15
End-to-End Neural Entity Linking Nikolaos Kolitsas * ETH Zürich [email protected] Octavian-Eugen Ganea * ETH Zürich [email protected] Thomas Hofmann ETH Zürich [email protected] Abstract Entity Linking (EL) is an essential task for se- mantic text understanding and information ex- traction. Popular methods separately address the Mention Detection (MD) and Entity Dis- ambiguation (ED) stages of EL, without lever- aging their mutual dependency. We here pro- pose the first neural end-to-end EL system that jointly discovers and links entities in a text doc- ument. The main idea is to consider all possi- ble spans as potential mentions and learn con- textual similarity scores over their entity can- didates that are useful for both MD and ED decisions. Key components are context-aware mention embeddings, entity embeddings and a probabilistic mention - entity map, without demanding other engineered features. Empir- ically, we show that our end-to-end method significantly outperforms popular systems on the Gerbil platform when enough training data is available. Conversely, if testing datasets follow different annotation conventions com- pared to the training set (e.g. queries/ tweets vs news documents), our ED model coupled with a traditional NER system offers the best or second best EL accuracy. 1 Introduction and Motivation Towards the goal of automatic text understanding, machine learning models are expected to accurately extract potentially ambiguous mentions of enti- ties from a textual document and link them to a knowledge base (KB), e.g. Wikipedia or Freebase. Known as entity linking, this problem is an essen- tial building block for various Natural Language Processing tasks, e.g. automatic KB construction, question-answering, text summarization, or rela- tion extraction. An EL system typically performs two tasks: i) Mention Detection (MD) or Named Entity Recog- * Equal contribution. 1) MD may split a larger span into two mentions of less informative entities: B. Obama’s wife gave a speech [...] Fed erer’s coach [...] 2) MD may split a larger span into two mentions of incorrect entities: Obama Cas tle was built in 1601 in Japan. The Ken nel Club is UK’s official kennel club. A bird dog is a type of gun dog or hunting dog. Romeo and Juliet by Shakespeare [...] Natural killer cells are a type of lymphocyte Mary and Max, the 2009 movie [...] 3) MD may choose a shorter span, referring to an incorrect entity: The Ap ple is played again in cinemas. The New York Times is a popular newspaper. 4) MD may choose a longer span, referring to an incorrect entity: Babies Romeo and Juliet were born hours apart. Table 1: Examples where MD may benefit from ED and viceversa. Each wrong MD decision ( un der - lined) can be avoided by proper context understand- ing. The correct spans are shown in blue. nition (NER) when restricted to named entities – extracts entity references in a raw textual input, and ii) Entity Disambiguation (ED) – links these spans to their corresponding entities in a KB. Un- til recently, the common approach of popular sys- tems (Ceccarelli et al., 2013; van Erp et al., 2013; Piccinno and Ferragina, 2014; Daiber et al., 2013; Hoffart et al., 2011; Steinmetz and Sack, 2013) was to solve these two sub-problems independently. However, the important dependency between the two steps is ignored and errors caused by MD/NER will propagate to ED without possibility of recov- ery (Sil and Yates, 2013; Luo et al., 2015). We here advocate for models that address the end-to-end arXiv:1808.07699v2 [cs.CL] 29 Aug 2018

End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich [email protected] Octavian-Eugen Ganea ETH Zürich [email protected]

  • Upload
    others

  • View
    37

  • Download
    0

Embed Size (px)

Citation preview

Page 1: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

End-to-End Neural Entity Linking

Nikolaos Kolitsas ∗ETH Zürich

[email protected]

Octavian-Eugen Ganea ∗ETH Zürich

[email protected]

Thomas HofmannETH Zürich

[email protected]

Abstract

Entity Linking (EL) is an essential task for se-mantic text understanding and information ex-traction. Popular methods separately addressthe Mention Detection (MD) and Entity Dis-ambiguation (ED) stages of EL, without lever-aging their mutual dependency. We here pro-pose the first neural end-to-end EL system thatjointly discovers and links entities in a text doc-ument. The main idea is to consider all possi-ble spans as potential mentions and learn con-textual similarity scores over their entity can-didates that are useful for both MD and EDdecisions. Key components are context-awaremention embeddings, entity embeddings anda probabilistic mention - entity map, withoutdemanding other engineered features. Empir-ically, we show that our end-to-end methodsignificantly outperforms popular systems onthe Gerbil platform when enough training datais available. Conversely, if testing datasetsfollow different annotation conventions com-pared to the training set (e.g. queries/ tweetsvs news documents), our ED model coupledwith a traditional NER system offers the bestor second best EL accuracy.

1 Introduction and Motivation

Towards the goal of automatic text understanding,machine learning models are expected to accuratelyextract potentially ambiguous mentions of enti-ties from a textual document and link them to aknowledge base (KB), e.g. Wikipedia or Freebase.Known as entity linking, this problem is an essen-tial building block for various Natural LanguageProcessing tasks, e.g. automatic KB construction,question-answering, text summarization, or rela-tion extraction.

An EL system typically performs two tasks: i)Mention Detection (MD) or Named Entity Recog-

∗Equal contribution.

1) MD may split a larger span into two mentions ofless informative entities:

B. Obama’s wife gave a speech [...]Federer’s coach [...]2) MD may split a larger span into two mentions of

incorrect entities:Obama Castle was built in 1601 in Japan.The Kennel Club is UK’s official kennel club.A bird dog is a type of gun dog or hunting dog.Romeo and Juliet by Shakespeare [...]Natural killer cells are a type of lymphocyteMary and Max, the 2009 movie [...]3) MD may choose a shorter span,

referring to an incorrect entity:The Apple is played again in cinemas.The New York Times is a popular newspaper.4) MD may choose a longer span,

referring to an incorrect entity:Babies Romeo and Juliet were born hours apart.

Table 1: Examples where MD may benefit from EDand viceversa. Each wrong MD decision (under-lined) can be avoided by proper context understand-ing. The correct spans are shown in blue.

nition (NER) when restricted to named entities –extracts entity references in a raw textual input,and ii) Entity Disambiguation (ED) – links thesespans to their corresponding entities in a KB. Un-til recently, the common approach of popular sys-tems (Ceccarelli et al., 2013; van Erp et al., 2013;Piccinno and Ferragina, 2014; Daiber et al., 2013;Hoffart et al., 2011; Steinmetz and Sack, 2013)was to solve these two sub-problems independently.However, the important dependency between thetwo steps is ignored and errors caused by MD/NERwill propagate to ED without possibility of recov-ery (Sil and Yates, 2013; Luo et al., 2015). We hereadvocate for models that address the end-to-end

arX

iv:1

808.

0769

9v2

[cs

.CL

] 2

9 A

ug 2

018

Page 2: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

EL task, informally arguing that humans under-stand and generate text in a similar joint manner,discussing about entities which are gradually intro-duced, referenced under multiple names and evolv-ing during time (Ji et al., 2017). Further, we em-phasize the importance of the mutual dependencybetween MD and ED. First, numerous and moreinformative linkable spans found by MD obviouslyoffer more contextual cues for ED. Second, find-ing the true entities appearing in a specific contextencourages better mention boundaries, especiallyfor multi-word mentions. For example, in the firstsentence of Table 1, understanding the presenceof the entity Michelle Obama helps detectingits true mention "B. Obama’s wife", as opposedto separately linking B. Obama and wife to lessinformative concepts.

We propose a simple, yet competitive, model forend-to-end EL. Getting inspiration from the recentworks of (Lee et al., 2017) and (Ganea and Hof-mann, 2017), our model first generates all possiblespans (mentions) that have at least one possibleentity candidate. Then, each mention - candidatepair receives a context-aware compatibility scorebased on word and entity embeddings coupled witha neural attention and a global voting mechanisms.During training, we enforce the scores of goldentity - mention pairs to be higher than all possiblescores of incorrect candidates or invalid mentions,thus jointly taking the ED and MD decisions.

Our contributions are:

• We address the end-to-end EL task using asimple model that conditions the "linkable"quality of a mention to the strongest con-text support of its best entity candidate. Wedo not require expensive manually annotatednegative examples of non-linkable mentions.Moreover, we are able to train competitivemodels using little and only partially anno-tated documents (with named entities onlysuch as the CoNLL-AIDA dataset).

• We are among the first to show that, with onesingle exception, engineered features can befully replaced by neural embeddings automat-ically learned for the joint MD & ED task.

• On the Gerbil1 benchmarking platform, weempirically show significant gains for the end-to-end EL task when test and training data

1http://gerbil.aksw.org/gerbil/

come from the same domain. Morever, whentesting datasets follow different annotationschemes or exhibit different statistics, ourmethod is still effective in achieving state-of-the-art or close performance, but needs to becoupled with a popular NER system.

2 Related Work

With few exceptions, MD/NER and ED are treatedseparately in the vast EL literature.

Traditional NER models usually view the prob-lem as a word sequence labeling that is modeled us-ing conditional random fields on top of engineeredfeatures (Finkel et al., 2005) or, more recently, us-ing bi-LSTMs architectures (Lample et al., 2016;Chiu and Nichols, 2016; Liu et al., 2017) capableof learning complex lexical and syntactic features.

In the context of ED, recent neural methods (Heet al., 2013; Sun et al., 2015; Yamada et al., 2016;Ganea and Hofmann, 2017; Le and Titov, 2018;Yang et al., 2018; Radhakrishnan et al., 2018) haveestablished state-of-the-art results, outperformingengineered features based models. Context awareword, span and entity embeddings, together withneural similarity functions, are essential in theseframeworks.

End-to-end EL is the realistic task and ultimategoal, but challenges in joint NER/MD and ED mod-eling arise from their different nature. Few previousmethods tackle the joint task, where errors in onestage can be recovered by the next stage. One ofthe first attempts, (Sil and Yates, 2013) use a popu-lar NER model to over-generate mentions and letthe linking step to take the final decisions. How-ever, their method is limited by the dependence ona good mention spotter and by the usage of hand-engineered features. It is also unclear how linkingcan improve their MD phase. Later, (Luo et al.,2015) presented one of the most competitive jointMD and ED models leveraging semi-ConditionalRandom Fields (semi-CRF). However, there areseveral weaknesses in this work. First, the mutualtask dependency is weak, being captured only bytype-category correlation features. The other engi-neered features used in their model are either NERor ED specific. Second, while their probabilisticgraphical model allows for tractable learning andinference, it suffers from high computational com-plexity caused by the usage of the cartesian prod-uct of all possible document span segmentations,NER categories and entity assignments. Another

Page 3: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

approach is J-NERD (Nguyen et al., 2016) that ad-dresses the end-to-end task using only engineeredfeatures and a probabilistic graphical model on topof sentence parse trees.

3 Neural Joint Mention Detection andEntity Disambiguation

We formally introduce the tasks of interest. ForEL, the input is a text document (or a query ortweet) given as a sequence D = {w1, . . . , wn} ofwords from a dictionary, wk ∈ W . The outputof an EL model is a list of mention - entity pairs{(mi, ei)}i∈1,T , where each mention is a word sub-sequence of the input document, m = wq, . . . , wr,and each entity is an entry in a knowledge baseKB (e.g. Wikipedia), e ∈ E . For the ED task,the list of entity mentions {mi}i=1,T that need tobe disambiguated is additionally provided as in-put. The expected output is a list of correspondingannotations {ei}i=1,T ∈ ET .

Note that, in this work, we only link mentionsthat have a valid gold KB entity, setting referredin (Röder et al., 2017) as InKB evaluation. Thus,we treat mentions referring to entities outside ofthe KB as "non-linkable". This is in line with fewprevious models, e.g. (Luo et al., 2015; Ganea andHofmann, 2017; Yamada et al., 2016). We leavethe interesting setting of discovering out-of-KBentities as future work.

We now describe the components of our neuralend-to-end EL model, depicted in Figure 1. Weaim for simplicity, but competitive accuracy.

Word and Char Embeddings. We use pre-trained Word2Vec vectors (Mikolov et al., 2013).In addition, we train character embeddings thatcapture important word lexical information. Fol-lowing (Lample et al., 2016), for each word inde-pendently, we use bidirectional LSTMs (Hochreiterand Schmidhuber, 1997) on top of learnable charembeddings. These character LSTMs do not ex-tend beyond single word boundaries, but they sharethe same parameters. Formally, let {z1, . . . , zL} bethe character vectors of word w. We use the for-ward and backward LSTMs formulations definedrecursively as in (Lample et al., 2016):

hft = FWD − LSTM(hft−1, zt) (1)

hbt = BKWD − LSTM(hbt+1, zt)

Then, we form the character embedding of w is[hfL;hb1] from the hidden state of the forward LSTM

corresponding to the last character concatenatedwith the hidden state of the backward LSTM cor-responding to the first character. This is then con-catenated with the pre-trained word embedding,forming the context-independent word-characterembedding of w. We denote the sequence of thesevectors as {vk}k∈1,n and depict it as the first neurallayer in Figure 1.

Mention Representation. We find it crucial tomake word embeddings aware of their local context,thus being informative for both mention boundarydetection and entity disambiguation (leveragingcontextual cues, e.g. "newspaper"). We thus en-code context information into words using a bi-LSTM layer on top of the word-character embed-dings {vk}k∈1,n. The hidden states of forward andbackward LSTMs corresponding to each word arethen concatenated into context-aware word embed-dings, whose sequence is denoted as {xk}k∈1,n.

Next, for each possible mention, we produce afixed size representation inspired by (Lee et al.,2017). Given a mention m = wq, . . . , wr, we firstconcatenate the embeddings of the first, last andthe "soft head" words of the mention:

gm = [xq;xr; x̂m] (2)

The soft head embedding x̂m is built using an at-tention mechanism on top of the mention’s wordembeddings, similar with (Lee et al., 2017):

αk = 〈wα, xk〉

amk =exp(αk)∑rt=q exp(αt)

(3)

x̂m =r∑

k=q

amk · vk

However, we found the soft head embedding toonly marginally improve results, probably due tothe fact that most mentions are at most 2 wordslong. To learn non-linear interactions between thecomponent word vectors, we project gm to a finalmention representation with the same size as en-tity embeddings (see below) using a shallow feed-forward neural network FFNN (a simple projectionlayer):

xm = FFNN1(gm) (4)

Entity Embeddings. We use fixed continuousentity representations, namely the pre-trained en-tity embeddings of (Ganea and Hofmann, 2017),

Page 4: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

Figure 1: Our global model architecture shown for the mention The New York Times. The final score isused for both the mention linking and entity disambiguation decisions.

due to their simplicity and compatibility with thepre-trained word vectors of (Mikolov et al., 2013).Briefly, these vectors are computed for each en-tity in isolation using the following exponentialmodel that approximates the empirical conditionalword-entity distribution p̂(w|e) obtained from co-occurrence counts.

exp(〈xw, ye〉)∑w′∈W exp(〈xw′ , ye〉)

≈ p̂(w|e) (5)

Here, xw are fixed pre-trained word vectors andye is the entity embedding to be trained. In prac-tice, (Ganea and Hofmann, 2017) re-write this as amax-margin objective loss.

Candidate Selection. For each spanm we selectup to s entity candidates that might be referred bythis mention. These are top entities based on anempirical probabilistic entity - map p(e|m) builtby (Ganea and Hofmann, 2017) from Wikipediahyperlinks, Crosswikis (Spitkovsky and Chang)and YAGO (Hoffart et al., 2011) dictionaries. Wedenote by C(m) this candidate set and use it bothat training and test time.

Final Local Score. For each span m that canpossibly refer to an entity (i.e. |C(m)| ≥ 1) andfor each of its entity candidates ej ∈ C(m), wecompute a similarity score using embedding dot-product that supposedly should capture useful in-formation for both MD and ED decisions. We thencombine it with the log-prior probability using ashallow FFNN, giving the context-aware entity -mention score:

Ψ(ej ,m) = FFNN2([log p(ej |m); 〈xm, yj〉])(6)

Long Range Context Attention. In some cases,our model might be improved by explicitly cap-turing long context dependencies. To test this, weexperimented with the attention model of (Ganeaand Hofmann, 2017). This gives one context em-bedding per mention based on informative contextwords that are related to at least one of the candi-date entities. We use this additional context embed-ding for computing dot-product similarity with anyof the candidate entity embeddings. This value isfed as additional input of FFNN2 in Eq. 6. We referto this model as long range context attention.

Page 5: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

Training. We assume a corpus with docu-ments and gold entity - mention pairs G ={(mi, e

∗i )}i=1,K is available. At training time, for

each input document we collect the set M of all(potentially overlapping) token spans m for which|C(m)| ≥ 1. We then train the parameters of ourmodel using the following minimization procedure:

θ∗ = arg minθ

∑m∈M

∑e∈C(m)

V (Ψθ(e,m)) (7)

where the violation term V enforces the scores ofgold pairs to be linearly separable from scores ofnegative pairs, i.e.

V (Ψ(e,m)) = (8){max(0, γ −Ψ(e,m)), if(e,m) ∈ Gmax(0,Ψ(e,m)), otherwise

Note that, in the absence of annotated negativeexamples of "non-linkable" mentions, we assumethat all spans in M and their candidates that do notappear in G should not be linked. The model willbe enforced to only output negative scores for allentity candidates of such mentions.

We call all spans training the above setting. Ourmethod can also be used to perform ED only, inwhich case we train only on gold mentions, i.e.M = {m|m ∈ G}. This is referred as gold spanstraining.

Inference. At test time, our method can be ap-plied for both EL and ED only as follows. First,for each document in our validation or test sets,we select all possibly linkable token spans, i.e.M = {m| |C(m)| ≥ 1} for EL, or the input set ofmentions M = {m|m ∈ G} for ED, respectively.Second, the best linking threshold δ is found onthe validation set such that the micro F1 metricis maximized when only linking mention - entitypairs with Ψ score greater than δ. At test time, onlyentity - mention pairs with a score higher than δare kept and sorted according to their Ψ scores;the final annotations are greedily produced basedon this set such that only spans not overlappingwith previously selected spans (of higher scores)are chosen.

Global Disambiguation Our current model is"local", i.e. performs disambiguation of each can-didate span independently. To enhance it, we addan extra layer to our neural network that will pro-mote coherence among linked and disambiguated

entities inside the same document, i.e. the globaldisambiguation layer. Specifically, we compute a"global" mention-entity score based on which weproduce the final annotations. We first define the setof mention-entity pairs that are allowed to partici-pate in the global disambiguation voting, namelythose that already have a high local score:

VG = {(m, e)|m ∈M, e ∈ C(m),Ψ(e,m) ≥ γ′}

Since we initially consider all possible spans andfor each span up to s candidate entities, this filteringstep is important to avoid both undesired noiseand exponential complexity for the EL task forwhichM is typically much bigger than for ED. Thefinal "global" score G(ej ,m) for entity candidateej of mention m is given by the cosine similaritybetween the entity embedding and the normalizedaverage of all other voting entities’ embeddings (ofthe other mentions m′).

V mG = {e|(m′, e) ∈ VG ∧m′ 6= m}

ymG =∑e∈Vm

G

ye

G(ej ,m) = cos(yej , ymG )

This is combined with the local score, yielding

Φ(ej ,m) = FFNN3([Ψ(ej ,m);G(ej ,m)])

The final loss function is now slightly modified.Specifically, we enforce the linear separability intwo places: in Ψ(e,m) (exactly as before), but alsoin Φ(e,m), as follows

θ∗ = arg minθ

(9)∑d∈D

∑m∈M

∑e∈C(m)

V (Ψθ(e,m)) + V (Φθ(e,m))

The inference procedure remains unchanged inthis case, with the exception that it will only usethe Φ(e,m) global score.

Coreference Resolution Heuristic. In a fewcases we found important to be able to solve simplecoreference resolution cases (e.g. "Alan" referringto "Alan Shearer"). These cases are difficult to han-dle by our candidate selection strategy. We thusadopt the simple heuristic descried in (Ganea andHofmann, 2017) and observed between 0.5% and1% improvement on all datasets.

Page 6: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

F1@MAF1@MI

AID

AA

AID

AB

MSN

BC

OK

E-2

015

OK

E-2

016

N3-

Reu

ters

-128

N3-

RSS

-500

Der

czyn

ski

KO

RE

50

FREME 23.637.6

23.836.3

15.819.9

26.131.6

22.728.5

26.830.9

32.527.8

31.418.9

12.314.5

FOX 54.758.0

58.157.0

11.28.3

53.956.8

49.550.5

52.453.3

35.133.8

42.038.0

28.330.8

Babelfy 41.247.2

42.448.5

36.639.7

39.341.9

37.837.7

19.623.0

32.129.1

28.929.8

52.555.9

Entityclassifier.eu 43.044.7

42.945.0

41.442.2

29.229.5

33.832.5

24.727.9

23.122.7

16.316.9

25.228.0

Kea 36.840.4

39.042.3

30.630.9

44.646.2

46.346.4

17.518.1

22.720.5

31.326.5

41.046.8

DBpedia Spotlight 49.955.2

52.057.8

42.440.6

42.044.4

41.443.1

21.524.8

26.727.2

33.732.2

29.434.9

AIDA 68.872.4

71.972.8

62.765.1

58.763.1

0.00.0

42.646.4

42.642.4

40.632.6

49.655.4

WAT 69.272.8

70.873.0

62.664.5

53.256.4

51.853.9

45.049.2

45.342.3

44.438.0

37.349.6

Best baseline 69.272.8

71.973.0

62.765.1

58.763.1

51.853.9

52.453.3

45.342.4

44.438.0

52.555.9

base model 86.689.1

81.180.5

64.565.7

54.358.2

43.646.0

47.749.0

44.238.8

43.538.1

34.942.0

base model + att 86.588.9

81.982.3

69.469.5

56.660.7

49.251.6

48.351.1

46.040.5

47.942.3

36.042.2

base model + att + global 86.689.4

82.682.4

73.072.4

56.661.9

47.852.7

45.450.3

43.838.2

43.234.1

26.235.2

ED base model + att + global usingStanford NER mentions 75.7

80.373.374.6

71.171.0

62.966.9

57.158.4

54.254.6

45.942.2

48.842.3

40.346.0

Table 2: EL strong matching results on the Gerbil platform. Micro and Macro F1 scores are shown. Wehighlight the best and second best models, respectively. Training was done on AIDA-train set.

4 Experiments

Datasets and Metrics. We used Wikipedia 2014as our KB. We conducted experiments on the mostimportant public EL datasets using the Gerbil plat-form (Röder et al., 2017). Datasets’ statistics areprovided in Tables 5 and 6 (Appendix). This bench-marking framework offers reliable and trustableevaluation and comparison with state of the artEL/ED methods on most of the public datasets forthis task. It also shows how well different systemsgeneralize to datasets from very different domainsand annotation schemes compared to their trainingsets. Moreover, it offers evaluation metrics for theend-to-end EL task, as opposed to some works thatonly evaluate NER and ED separately, e.g. (Luoet al., 2015).

As previously explained, we do not use the NILmentions (without a valid KB entity) and only com-pare against other systems using the InKB metrics.

For training we used the biggest publicly avail-able EL dataset, AIDA/CoNLL (Hoffart et al.,2011), consisting of a training set of 18,448 linkedmentions in 946 documents, a validation set of4,791 mentions in 216 documents, and a test set of4,485 mentions in 231 documents.

We report micro and macro InKB F1 scores for

both EL and ED. For EL, these metrics are com-puted both in the strong matching and weak match-ing settings. The former requires exactly predictingthe gold mention boundaries and their entity anno-tations, whereas the latter gives a perfect score tospans that just overlap with the gold mentions andare linked to the correct gold entities.

Baselines. We compare with popular and state-of-the-art EL and ED public systems, e.g.WAT (Piccinno and Ferragina, 2014), DbpediaSpotlight (Mendes et al., 2011), NERD-ML (vanErp et al., 2013), KEA (Steinmetz and Sack, 2013),AIDA (Hoffart et al., 2011), FREME, FOX (Speckand Ngomo, 2014), PBOH (Ganea et al., 2016) orBabelfy (Moro et al., 2014). These are integratedwithin the Gerbil platform. However, we did notcompare with (Luo et al., 2015) and other mod-els outside Gerbil that do not use end-to-end ELmetrics.

Training Details and Model Hyper-Parameters.We use the same model hyper-parameters for bothEL and ED training. The only difference betweenthe two settings is the span set M : EL uses allspans training, whereas ED uses gold spans train-ing settings as explained before.

Our pre-trained word and entity embeddings are

Page 7: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

solved%matched gold entity%

position of groundtruth in p(e|m)

Num

bero

fm

entio

ns

ED

Glo

balm

odel

ED

Bas

e+

Att

mod

el

ED

Bas

em

odel

EL

Glo

balm

odel

EL

Bas

e+

Att

mod

el

EL

Bas

em

odel

EL

Glo

bal-

logp

(e|m

)

1 3390 98.898.8

98.698.9

98.398.8

96.799.0

96.698.6

96.299.0

93.396.7

2 655 89.989.9

88.188.5

88.588.9

86.890.8

86.890.8

85.088.4

86.989.8

3 108 83.384.3

81.582.4

75.978.7

79.184.5

80.284.7

74.881.1

84.388.9

4-8 262 78.278.2

76.379.0

74.876.0

69.578.2

68.878.7

68.776.0

78.983.5

9+ 247 59.963.6

53.460.7

53.058.7

47.858.2

46.259.4

50.454.8

62.767.5

Table 3: AIDA A dataset: Gold mentions are split by the position they appear in the p(e|m) dictionary.In each cell, the upper value is the percentage of the gold mentions that were annotated with the correctentity (recall), whereas the lower value is the percentage of gold mentions for which our system’s highestscored entity is the ground truth entity, but that might not be annotated in the end because its score isbelow the threshold δ.

300 dimensional, while 50 dimensional trainablecharacter vectors were used. The char LSTMs havealso hidden dimension of 50. Thus, word-characterembeddings are 400 dimensional. The contextualLSTMs have hidden size of 150, resulting in 300dimensional context-aware word vectors. We applydropout on the concatenated word-character em-beddings, on the output of the bidirectional contextLSTM and on the entity embeddings used in Eq. 6.The three FFNNs in our model are simple projec-tions without hidden layers (no improvements wereobtained with deeper layers). For the long rangecontext attention we used a word window size of K= 200 and keep top R = 10 words after the hard at-tention layer (notations from (Ganea and Hofmann,2017)). We use at most s = 30 entity candidate permention both at train and test time. γ is set to 0.2without further investigations. γ′ is set to 0, but avalue of 0.1 was giving similar results.

For the loss optimization we use Adam (Kingmaand Ba, 2014) with a learning rate of 0.001. Weperform early stopping by evaluating the modelon the AIDA validation set each 10 minutes andstopping after 6 consecutive evaluations with nosignificant improvement in the macro F1 score.

Results and Discussion. The following modelsare used.

i) Base model: only uses the mention local scoreand the log-prior. It does not use long range atten-tion, nor global disambiguation. It does not use thehead attention mechanism.

ii) Base model + att: the Base Model plus LongRange Context Attention.

ii) Base model + att + global: our Global Model(depicted in figure 1)

iv) ED base model + att + global StanfordNER: our ED Global model that runs on top ofthe detected mentions of the Stanford NER sys-tem (Finkel et al., 2005).

EL strong and weak matching results are pre-sented in Tables 2 and 7 (Appendix).

We first note that our system outperforms allbaselines on the end-to-end EL task on both AIDA-A (dev) and AIDA-B (test) datasets, which are thebiggest EL datasets publicly available. Moreover,we surpass all competitors on both EL and ED by alarge margin, at least 9%, showcasing the effective-ness of our method. We also outperform systemsthat optimize MD and ED separately, including ourED base model + att + global Stanford NER. Thisdemonstrates the merit of joint MD + ED optimiza-tion.

In addition, one can observe that weak matchingEL results are comparable with the strong matchingresults, showcasing that our method is very goodat detecting mention boundaries.

At this point, our main goal was achieved: ifenough training data is available with the samecharacteristics or annotation schemes as the testdata, then our joint EL offers the best model. Thisis true not only when training on AIDA, but also forother types of datasets such as queries (Table 11)or tweets (Table 12). However, when testing data

Page 8: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

1) Annotated document:SOCCER - [SHEARER ] NAMED AS [1 [1ENGLAND] CAPTAIN]. [ LONDON ] 1996-08-30 Theworld ’s costliest footballer [ Alan Shearer ] was named as the new [England ] captain on Friday. The26-year-old, who joined [Newcastle] for 15 million pounds sterling, takes over from [Tony Adams],who led the side during the [ European ] championship in June, and former captain [ David Platt ] .[2Adams] and [ Platt ] are both injured and will miss [England]’s opening [3[World Cup]] qualifieragainst [Moldova] on Sunday . [Shearer] takes the captaincy on a trial basis , but new coach [GlennHoddle] said he saw no reason why the former [Blackburn] and [Southampton] skipper should notmake the post his own . "I ’m sure [4Alan] is the man for the job , " [Hoddle] said . [...] I spoke to[5Alan] he was up for it [...]. [Shearer] ’s [Euro 96] striking partner [...].Analysis:[1 is considered a false negative and the ground truth is the entity England_national_football_team. Ourannotation was for the span "ENGLAND CAPTAIN" and wrongly linked to the England_cricket_team.[3 The ground truth here is 1998_FIFA_World_Cup whereas our model links it to FIFA_World_Cup.[2,4,5 are correctly solved due to our coreference resolution heuristic.2) Annotated document:[ N. Korea ] urges [ S. Korea ] to return war veteran . [ SEOUL ] 1996-08-31 [ North Korea ] demandedon Saturday that [ South Korea ] return a northern war veteran who has been in the [1 South ] sincethe 1950-53 war , [ Seoul ] ’s unification ministry said . " ...I request the immediate repatriation ofKim In-so to [ North Korea ] where his family is waiting , " [1 North Korean ] Red Cross president LiSong-ho said in a telephone message to his southern couterpart , [2 Kang Young-hoon ] . Li said Kimhad been critically ill with a cerebral haemorrhage . The message was distributed to the press by the [South Korean ] unification ministry . Kim , an unrepentant communist , was captured during the [2[3 Korean ] War ] and released after spending more than 30 years in a southern jail . He submitted apetition to the [ International Red Cross ] in 1993 asking for his repatriation . The domestic [ Yonhap ]news agency said the [ South Korean ] government would consider the northern demand only if the[3 North ] accepted [ Seoul ] ’s requests , which include regular reunions of families split by the [4[4 Korean ] War ] . Government officials were not available to comment . [ South Korea ] in 1993unconditionally repatriated Li In-mo , a nothern partisan seized by the [5 South ] during the war andjailed for more than three decades.Analysis:[1, [3, and [5 cases illustrate the main source of errors. These are false negatives in which our modelhas the correct ground truth entity pair as the highest scored one for that mention, but since it is notconfident enough (score < γ) it decides not to annotate that mention. In this specific document theseerrors could probably be avoided easily with a better coreference resolution mechanism.[3 and [4 cases illustrate that the gold standard can be problematic. Specifically, instead of annotatingthe whole span Korean War and linking it to the war of 1950, the gold annotation only include Koreanand link it to the general entity of Korea_(country).[2 is correctly annotated by our system but it is not included in the gold standard.

Table 4: Error analysis on a sample document. Green corresponds to true positive (correctly discoveredand annotated mention), red to false negative (ground truth mention or entity that was not annotated) andorange to false positive (incorrect mention or entity annotation ).

has different statistics or follows different conven-tions than the training data, our method is shownto work best in conjunction with a state-of-the-artNER system as it can be seen from the results of ourED base model + att + global Stanford NER fordifferent datasets in Table 2. It is expected that sucha NER system designed with a broader generaliza-

tion scheme in mind would help in this case, whichis confirmed by our results on different datasets.

While the main focus of this paper is the end-to-end EL and not the ED-only task, we do showED results in Tables 8 and 9. We observe that ourmodels are slightly behind recent top performingsystems, but our unified EL - ED architecture has

Page 9: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

to deal with other challenges, e.g. being able toexchange global information between many morementions at the EL stage, and is thus not suitable forexpensive global ED strategies. We leave bridgingthis gap for ED as future work.

Additional results and insights are shown in theAppendix.

Ablation study Table 3 shows an ablation studyof our method. One can see that the log p(e|m)prior is very helpful for correctly linking unam-biguous mentions, but is introducing noise whengold entities are not frequent. For this category ofrare entities, removing this prior completely willresult in a significant improvement, but this is nota practical choice since the gold entity is unknownat test time.

Error Analysis. We conducted a qualitative ex-periment shown in Table 4. We showcase correctannotations, as well as errors done by our systemon the AIDA datasets. Inspecting the output anno-tations of our EL model, we discovered the remark-able property of not over-generating incorrect men-tions, nor under-generating (missing) gold spans.We also observed that additional mentions gener-ated by our model do correspond in the majorityof time to actual KB entities, but are incorrectlyforgotten from the gold annotations.

5 Conclusion

We presented the first neural end-to-end entity link-ing model and show the benefit of jointly optimiz-ing entity recognition and linking. Leveraging keycomponents, namely word, entity and mention em-beddings, we prove that engineered features canbe almost completely replaced by modern neuralnetworks. Empirically, on the established Gerbilbenchmarking platform, we exhibit state-of-the-artperformance for EL on the biggest public dataset,AIDA/CoNLL, also showing good generalizationability on other datasets with very different char-acteristics when combining our model with thepopular Stanford NER system.

Our code is publicly available2.

ReferencesDavid Carmel, Ming-Wei Chang, Evgeniy Gabrilovich,

Bo-June Paul Hsu, and Kuansan Wang. 2014.

2https://github.com/dalab/end2end_neural_el

Erd’14: entity recognition and disambiguation chal-lenge. In ACM SIGIR Forum, volume 48, pages 63–77. ACM.

Diego Ceccarelli, Claudio Lucchese, Salvatore Or-lando, Raffaele Perego, and Salvatore Trani. 2013.Dexter: an open source framework for entity linking.In Proceedings of the sixth international workshopon Exploiting semantic annotations in informationretrieval, pages 17–20. ACM.

Jason PC Chiu and Eric Nichols. 2016. Named entityrecognition with bidirectional lstm-cnns. Transac-tions of the Association for Computational Linguis-tics, 4:357–370.

Marco Cornolti, Paolo Ferragina, Massimiliano Cia-ramita, Stefan Rüd, and Hinrich Schütze. 2016. Apiggyback system for joint entity mention detectionand linking in web queries. In Proceedings of the25th International Conference on World Wide Web,pages 567–578. International World Wide Web Con-ferences Steering Committee.

Joachim Daiber, Max Jakob, Chris Hokamp, andPablo N Mendes. 2013. Improving efficiency andaccuracy in multilingual entity extraction. In Pro-ceedings of the 9th International Conference on Se-mantic Systems, pages 121–124. ACM.

Leon Derczynski, Diana Maynard, Giuseppe Rizzo,Marieke van Erp, Genevieve Gorrell, RaphaëlTroncy, Johann Petrak, and Kalina Bontcheva. 2015.Analysis of named entity recognition and linkingfor tweets. Information Processing & Management,51(2):32–49.

MGJ van Erp, G Rizzo, and R Troncy. 2013. Learn-ing with the web: Spotting named entities on the in-tersection of nerd and machine learning. In CEURworkshop proceedings, pages 27–30.

Jenny Rose Finkel, Trond Grenager, and ChristopherManning. 2005. Incorporating non-local informa-tion into information extraction systems by gibbssampling. In Proceedings of the 43rd annual meet-ing on association for computational linguistics,pages 363–370. Association for Computational Lin-guistics.

Octavian-Eugen Ganea, Marina Ganea, Aurelien Luc-chi, Carsten Eickhoff, and Thomas Hofmann. 2016.Probabilistic bag-of-hyperlinks model for entitylinking. In Proceedings of the 25th InternationalConference on World Wide Web, pages 927–938. In-ternational World Wide Web Conferences SteeringCommittee.

Octavian-Eugen Ganea and Thomas Hofmann. 2017.Deep joint entity disambiguation with local neuralattention. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 2619–2629.

Page 10: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, LongkaiZhang, and Houfeng Wang. 2013. Learning entityrepresentation for entity disambiguation. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), volume 2, pages 30–34.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen,Martin Theobald, and Gerhard Weikum. 2012. Kore:keyphrase overlap relatedness for entity disambigua-tion. In Proceedings of the 21st ACM internationalconference on Information and knowledge manage-ment, pages 545–554. ACM.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Fürstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of namedentities in text. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing,pages 782–792. Association for Computational Lin-guistics.

Yangfeng Ji, Chenhao Tan, Sebastian Martschat, YejinChoi, and Noah A Smith. 2017. Dynamic entity rep-resentations in neural language models. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1830–1839.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In Proceedings of NAACL-HLT, pages 260–270.

Phong Le and Ivan Titov. 2018. Improving entity link-ing by modeling latent relations between mentions.arXiv preprint arXiv:1804.10637.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,pages 188–197.

Liyuan Liu, Jingbo Shang, Frank Xu, Xiang Ren, HuanGui, Jian Peng, and Jiawei Han. 2017. Empowersequence labeling with task-aware neural languagemodel. arXiv preprint arXiv:1709.04109.

Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Za-iqing Nie. 2015. Joint entity recognition and disam-biguation. In Proceedings of the 2015 Conferenceon Empirical Methods in Natural Language Process-ing, pages 879–888.

Pablo N Mendes, Max Jakob, Andrés García-Silva, andChristian Bizer. 2011. Dbpedia spotlight: sheddinglight on the web of documents. In Proceedings of

the 7th international conference on semantic sys-tems, pages 1–8. ACM.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Andrea Moro, Alessandro Raganato, and Roberto Nav-igli. 2014. Entity linking meets word sense disam-biguation: a unified approach. Transactions of theAssociation for Computational Linguistics, 2:231–244.

Dat Ba Nguyen, Martin Theobald, and GerhardWeikum. 2016. J-nerd: joint named entity recogni-tion and disambiguation with rich linguistic features.Transactions of the Association for ComputationalLinguistics, 4:215–229.

Andrea Giovanni Nuzzolese, Anna Lisa Gentile,Valentina Presutti, Aldo Gangemi, Darío Garigliotti,and Roberto Navigli. 2015. Open knowledge extrac-tion challenge. In Semantic Web Evaluation Chal-lenge, pages 3–15. Springer.

Francesco Piccinno and Paolo Ferragina. 2014. Fromtagme to wat: a new entity annotator. In Proceed-ings of the first international workshop on Entityrecognition & disambiguation, pages 55–62. ACM.

Priya Radhakrishnan, Partha Talukdar, and VasudevaVarma. 2018. Elden: Improved entity linking us-ing densified knowledge graphs. In Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers), volume 1, pages 1844–1853.

Giuseppe Rizzo, Marieke van Erp, and Raphaël Troncy.Benchmarking the extraction and disambiguation ofnamed entities on the semantic web.

Michael Röder, Ricardo Usbeck, and Axel-CyrilleNgonga Ngomo. 2017. Gerbil–benchmarkingnamed entity recognition and linking consistently.Semantic Web, (Preprint):1–21.

Avirup Sil and Alexander Yates. 2013. Re-ranking forjoint named-entity recognition and linking. In Pro-ceedings of the 22nd ACM international conferenceon Conference on information & knowledge manage-ment, pages 2369–2374. ACM.

René Speck and Axel-Cyrille Ngonga Ngomo. 2014.Ensemble learning for named entity recognition. InInternational semantic web conference, pages 519–534. Springer.

Valentin I Spitkovsky and Angel X Chang. A cross-lingual dictionary for english wikipedia concepts.

Nadine Steinmetz and Harald Sack. 2013. Semanticmultimedia information retrieval based on contex-tual descriptions. In Extended Semantic Web Con-ference, pages 382–396. Springer.

Page 11: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhen-zhou Ji, and Xiaolong Wang. 2015. Modeling men-tion, context and entity with neural networks for en-tity disambiguation. In Twenty-Fourth InternationalJoint Conference on Artificial Intelligence.

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, andYoshiyasu Takefuji. 2016. Joint learning of the em-bedding of words and entities for named entity dis-ambiguation. In Proceedings of The 20th SIGNLLConference on Computational Natural LanguageLearning, pages 250–259.

Yi Yang, Ozan Irsoy, and Kazi Shefaet Rahman. 2018.Collective entity disambiguation with structured gra-dient tree boosting. In Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers), vol-ume 1, pages 777–786.

3https://github.com/dice-group/gerbil/issues/98

Page 12: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

position of groundtruth in p(e|m)

AIDA Train AIDA A (dev) AIDA B (test)

11370475.6%

339072.7%

307869.8%

22339

12.9%65514%

65114.8%

3565

3.1%108

2.3%184

4.2%

4-8843

4.6%262

5.6%3087%

9+686

3.8%247

5.3%187

4.2%Total number of non-NIL mentions 18137 4662 4408

Unambiguous mentions, i.e.for which ground truth is

the unique candidate

357519.7%

93620.1%

83318.9%

Table 5: Statistics on the number of mentions for various settings and for all AIDA datasets.

Dat

aset

Topi

c

Type

ofan

nota

tions

Exh

aust

ive

anno

tatio

ns

Num

Doc

s

Avg.

Wor

ds/D

oc.

Num

InK

BA

nns

Num

ofgo

lden

titie

sno

tin

our

KB

Rec

all3

0(%

)

Rec

all1

0(%

)

AIDA news NE 3 231 190 4483 5 98.3 95.7AQUAINT news NE&N 7 50 221 727 16 93.5 92.6MSNBC news NE 3 20 544 653 3 92.8 90.3ACE2004 news NE 7 57 374 257 2 87.9 86.8

Derczynski(Derczynski et al., 2015)

tweets NE 3 182 21 210 32 71.0 69.5

Microposts2016 dev(Rizzo et al.)

tweets NE 3 100 14 251 50 59.8 59.8

N3-Reuters-128 news NE 3 128 124 631 7 68.9 64.3N3-RSS-500 news NE 7 500 31 519 11 81.3 79.4

DBpedia Spotlight(Mendes et al., 2011)

news NE&N 3 58 29 330 7 90.6 90.3

KORE 50(Hoffart et al., 2012)

mixed NE 3 50 13 144 1 88.2 75.7

ERD2014(Carmel et al., 2014)

queries NE 3 91 3.5 59 1 72.9 72.9

Gerdaq(Cornolti et al., 2016)

queries NE&N 3 250 3.6 408 3 70.8 68.6

OKE(Nuzzolese et al., 2015)

wikipedia NE&roles 3 55 26.6 288 3 81.9 81.3

Table 6: Test datasets statistics. Topic has the following possible values: news articles, twitter messages,search queries, or sentences extracted from Wikipedia articles. Type of annotations can be NE (NamedEntities i.e. persons, locations, organizations) and/or N (Nouns i.e. dataset annotates even simple nouns).Exhaustive annotations means that these datasets annotate all entities that appear in their text and thatfall into the selected categories of Type of Annotations. Recall 30/10 shows the percentage of the goldmentions for which their ground truth entity is among the first 30/10 candidate entities returned by ourp(e|m) dictionary.

Page 13: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

F1@MAF1@MI

AID

AA

AID

AB

MSN

BC

OK

E-2

015

OK

E-2

016

N3-

Reu

ters

-128

N3-

RSS

-500

Der

czyn

ski

KO

RE

50

FREME 23.938.3

24.337.0

17.622.6

26.832.7

23.029.0

28.233.5

34.632.4

31.820.4

13.315.5

FOX 54.958.1

58.457.4

26.321.2

56.458.7

50.151.4

53.855.2

37.737.2

43.441.5

30.132.9

Babelfy 42.148.1

43.349.5

40.244.9

42.845.7

40.139.7

29.833.6

36.634.6

32.334.3

53.056.6

Entityclassifier.eu 44.947.0

45.347.6

47.248.9

32.733.0

35.733.6

31.835.3

24.524.1

18.119.3

27.130.1

Kea 38.742.9

42.144.7

35.036.3

53.954.4

54.955.2

22.122.4

25.924.7

33.229.2

46.550.5

DBpedia Spotlight 55.558.3

53.158.8

49.649.2

49.652.2

49.249.8

36.038.0

35.535.1

44.041.5

28.835.4

AIDA 69.072.6

72.373.5

66.068.1

61.765.8

43.347.2

44.748.5

47.347.5

41.534.8

52.958.6

WAT 69.773.3

71.673.6

66.768.9

58.460.7

55.256.6

54.458.3

47.646.5

47.143.6

39.952.2

Best baseline 69.773.3

72.373.6

66.768.9

61.765.8

55.256.6

54.458.3

47.647.5

47.143.6

53.058.6

base model 87.089.5

81.780.8

67.368.8

56.260.1

45.047.0

48.751.4

46.542.8

44.940.1

36.944.7

base model + att 87.189.3

82.482.6

72.672.6

59.063.0

49.251.6

51.054.7

48.444.6

49.344.4

38.045.1

base model + att + global 87.289.8

83.282.8

75.774.7

59.464.6

48.153.1

47.053.2

46.342.7

43.834.9

28.238.3

ED base model + att + global usingStanford NER mentions 76.0

80.573.975.0

74.774.8

65.368.9

57.759.3

54.956.9

48.646.1

50.244.3

41.947.8

Table 7: EL weak matching results on the Gerbil platform. Micro and Macro F1 scores are shown. Wehighlight in red and blue the best and second best models, respectively. Training was done on AIDA-trainset.

F1@MAF1@MI

AC

E20

04

AID

AA

AID

AB

AQ

UA

INT

MSN

BC

DB

pedi

aSpo

tligh

t

Der

czyn

ski

ER

D20

14

GE

RD

AQ

-Dev

GE

RD

AQ

-Tes

t

KO

RE

50

N3-

Reu

ters

-128

N3-

RSS

-500

OK

E-2

015

OK

E-2

016

AGDISTIS 78.165.6

49.255.6

54.254.9

58.960.4

72.974.0

35.339.3

50.844.5

63.232.0

21.417.8

22.820.3

30.033.3

68.563.8

51.451.8

61.462.0

59.658.8

FREME 73.560.3

25.333.0

25.740.3

43.757.6

17.924.3

37.952.5

48.338.2

71.351.8

37.644.2

33.439.6

13.817.4

30.231.7

45.543.9

25.033.3

27.435.0

FOX 38.60.0

55.558.9

59.658.0

0.00.0

16.213.3

11.715.9

51.244.3

51.70.0

11.20.0

10.60.0

26.931.1

56.658.1

52.953.6

54.959.0

50.151.4

Babelfy 76.661.9

66.471.9

70.075.8

70.370.4

72.776.2

51.051.8

64.463.2

69.645.0

30.634.3

31.535.0

70.574.1

54.655.4

62.864.1

65.165.9

60.259.8

Entityclassifier.eu 72.661.9

53.653.7

54.353.9

39.943.1

54.154.6

19.925.5

43.232.6

51.70.0

11.20.0

10.60.0

25.929.9

40.344.4

37.644.9

35.337.6

40.341.4

Kea 87.479.5

62.663.5

69.068.5

79.980.6

80.979.3

75.572.2

63.661.5

87.471.0

56.860.6

59.763.3

50.960.8

62.762.7

58.664.5

68.569.1

67.769.1

DBpedia Spotlight 73.453.9

51.553.2

53.756.1

50.651.8

43.642.1

70.172.6

50.343.3

65.238.4

49.254.4

46.854.0

48.752.3

41.843.4

42.634.6

30.435.8

43.043.1

AIDA 89.579.8

71.474.7

77.877.0

56.657.1

70.374.6

21.424.9

64.563.9

79.163.7

26.523.5

27.225.0

63.269.1

51.256.9

62.965.5

61.663.0

0.00.0

WAT 86.979.6

75.678.5

79.880.5

75.675.4

79.778.8

68.867.1

70.469.5

86.576.0

61.263.9

57.460.7

52.462.2

59.263.1

62.863.9

62.264.9

0.00.0

PBOH 88.083.2

77.380.1

79.080.4

83.384.1

85.586.1

80.580.1

63.664.8

91.080.7

67.068.8

64.467.8

58.663.2

72.169.0

47.655.2

67.467.6

0.00.0

Best baseline 89.583.2

77.380.1

79.880.5

83.384.1

85.586.1

80.580.1

70.469.5

91.080.7

67.068.8

64.467.8

70.574.1

72.169.0

62.965.5

68.569.1

67.769.1

ed base model 89.983.5

86.588.8

82.782.7

78.479.7

82.482.5

71.171.0

63.562.7

81.065.3

54.258.4

52.757.6

45.150.8

62.966.4

64.065.8

69.770.0

71.973.0

ed base model + att 89.584.1

86.088.9

83.683.1

81.282.2

88.486.4

67.872.8

64.365.0

79.965.2

41.849.2

46.654.7

48.355.5

62.567.3

67.268.4

73.774.5

75.176.1

ed base model + att + global 92.085.5

86.689.0

83.883.0

82.383.2

88.586.2

74.478.8

65.365.4

76.961.4

38.646.3

43.253.7

52.460.8

63.467.3

66.668.6

73.274.0

76.778.1

Table 8: ED results on the Gerbil platform. Micro and Macro F1 scores are shown. We highlight in red and bluethe best and second best models, respectively. Training was done on AIDA-train set.

Page 14: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

F1@MAF1@MI

AID

AA

AID

AB

AC

E20

04

AQ

UA

INT

MSN

BC

ed base model 92.193.3

88.486.7

84.485.4

85.986.1

88.289.6

ed base model + att 91.793.5

89.087.2

87.387.0

89.189.4

93.292.3

ed base model + att + global 92.493.7

89.187.3

89.288.1

89.489.7

93.392.5

Table 9: ED results matched locally (not on Gerbil) using KB entity IDs. Micro and Macro F1 scoresare shown. For the ED task there is a significant discrepancy between the local and the Gerbil scoreswhich likely has two main causes: a different URL-based matching scheme used in Gerbil3and parsing oralignment errors.

Rec@MARec@MI

AC

E20

04

AQ

UA

INT

FREME 44.830.8

23.723.7

FOX 0.00.0

0.00.0

Babelfy 10.017.8

34.935.8

Entityclassifier.eu 35.759.7

28.429.2

Kea 23.740.3

36.135.9

DBpedia Spotlight 39.160.5

45.145.2

AIDA 28.048.6

36.036.5

WAT 26.844.7

35.135.5

Best baseline 44.860.5

45.145.2

model+att+global 40.368.3

39.440.4

Table 10: El strong matching results on the Gerbil platform. Micro and Macro recall scores are shown forACE2004 and AQUAINT datasets that only annotate each entity once, thus F1 scores not being useful.We highlight in red and blue the best and second best models, respectively. Our models were trained onAIDA as before.

Page 15: End-to-End Neural Entity Linking - arXivEnd-to-End Neural Entity Linking Nikolaos Kolitsas ETH Zürich nikos_kolitsas@hotmail.com Octavian-Eugen Ganea ETH Zürich octavian.ganea@inf.ethz.ch

F1@MAF1@MI

GE

RD

AQ

-Dev

GE

RD

AQ

-Tes

t

DB

pedi

aSpo

tligh

t

FREME 11.20.0

10.60.0

11.613.0

FOX 11.20.0

10.60.0

11.315.3

Babelfy 19.615.5

19.617.5

8.713.1

Entityclassifier.eu 11.20.0

10.60.0

18.323.1

Kea 40.240.2

40.243.7

42.042.6

DBpedia Spotlight 29.627.8

29.431.4

31.837.6

AIDA 11.20.0

10.60.0

14.419.1

WAT 12.93.3

12.66.0

14.517.0

Best baseline 40.240.2

40.243.7

42.042.6

model+att+global on gerdaq train 42.744.9

40.041.2

33.434.8

Table 11: El strong matching results on the Gerbil platform. Micro and Macro F1 scores are shown.Training was done on Gerdaq train dataset. We highlight in red and blue the best and second best models,respectively.

F1@MAF1@MI

Mic

ropD

ev

Mic

ropT

est

FREME 3.93.8

77.93.7

FOX 6.17.5

71.93.7

Babelfy 6.29.5

52.52.8

Entityclassifier.eu 23.529.1

9.22.4

Kea 37.338.8

51.66.5

DBpedia Spotlight 29.134.0

13.65.2

AIDA 6.49.7

73.35.0

WAT 0.00.0

0.00.0

Best baseline 37.338.8

77.96.5

model on micropost train 57.363.7

17.410.0

Table 12: El strong matching results on the Gerbil platform when training on Micropost-train 2016. Microand Macro F1 scores are shown. The majority of the errors are due to the newer KB version used in thisdataset. We highlight in red and blue the best and second best models, respectively.