GERBIL – General Entity Annotator Benchmarking Frameworknavigli/pubs/... · BAT-framework [7], our approach goes beyond the state of the art in several respects: GERBIL provides

GERBIL – General Entity Annotator BenchmarkingFramework

Ricardo Usbeck∗Leipzig University, IFI/AKSW

Unister GmbH, Leipzigusbeck@informatik.

uni-leipzig.de

Michael Röder∗Leipzig University, IFI/AKSW

Unister GmbH, Leipzigroeder@informatik.

uni-leipzig.de

Axel-Cyrille NgongaNgomo∗

Leipzig University, IFI/AKSWLeipzig (Germany)

[email protected]

ABSTRACTThe need to bridge between the unstructured data on theDocument Web and the structured data on the Web of Datahas led to the development of a considerable number of an-notation tools. However, these tools are currently still hardto compare since the published evaluation results are cal-culated on diverse datasets and evaluated based on differ-ent measures. We present GERBIL, an evaluation frame-work for semantic entity annotation. The rationale behindour framework is to provide developers, end users and re-searchers with easy-to-use interfaces that allow for the agile,fine-grained and uniform evaluation of annotation tools onmultiple datasets. By these means, we aim to ensure thatboth tool developers and end users can derive meaningfulinsights pertaining to the extension, integration and use ofannotation applications. In particular, GERBIL providescomparable results to tool developers so as to allow them toeasily discover the strengths and weaknesses of their imple-mentations with respect to the state of the art. With thepermanent experiment URIs provided by our framework, weensure the reproducibility and archiving of evaluation re-sults. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying andpost-processing of evaluation results. Finally, the tool diag-nostics provided by GERBIL allows deriving insights per-taining to the areas in which tools should be further refined,thus allowing developers to create an informed agenda forextensions and end users to detect the right tools for theirpurposes. GERBIL aims to become a focal point for thestate of the art, driving the research agenda of the commu-nity by presenting comparable objective evaluation results.

∗Additional authors are listed at the end of the article.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2015, May 18–22, 2015, Florence, Italy.ACM 978-1-4503-3469-3/15/05.http://dx.doi.org/10.1145/2736277.2741626.

Categories and Subject DescriptorsI.2.7 [Natural Language Processing]: Text analysis

KeywordsSemantic Entity Annotation System; Reusability; Archiv-ability; Benchmarking Framework

1. INTRODUCTIONThe implementation of the original vision behind the Se-

mantic Web demands the development of approaches andframeworks for the seamless extraction of structured datafrom text. While manifold annotation tools have been de-veloped over the last years to address (some of) the subtasksrelated to the extraction of structured data from unstruc-tured data [13, 17, 24, 25, 27, 32, 35, 41, 44], the provisionof comparable results for these tools remains a tedious prob-lem. The issue of comparability of results is not to be re-garded as being intrinsic to the annotation task. Indeed, itis now well established that scientists spend between 60 and80% of their time preparing data for experiments [14, 18, 31].Data preparation being such a tedious problem in the anno-tation domain is mostly due to the different formats of thegold standards as well as the different data representationsacross reference datasets. These restrictions have led to au-thors evaluating their approaches on datasets (1) that areavailable to them and (2) for which writing a parser as wellas of an evaluation tool can be carried out with reasonableeffort. In addition, a large number of quality measures havebeen developed and used actively across the annotation re-search community to evaluate the same task, leading to theresults across publications on the same topics not being eas-ily comparable. For example, while some authors publishmacro-F-measures and simply call them F-measures, otherspublish micro-F-measures for the same purpose, leading tosignificant discrepancies across the scores. The same holdsfor the evaluation of how well entities match. Indeed, partialmatches and complete matches have been used in previousevaluations of annotation tools [7, 39]. This heterogeneouslandscape of tools, datasets and measures leads to a poorrepeatability of experiments, which makes the evaluation ofthe real performance of novel approaches against the stateof the art rather difficult.

The insights above have led to a movement towards thecreation of frameworks to ease the evaluation of solutionsthat address the same annotation problem [5, 7]. In thispaper, we present GERBIL – a general entity annotator

1133

http://dx.doi.org/10.1145/2736277.2741626

benchmark –, a community-driven effort to enable the con-tinuous evaluation of annotation tools. GERBIL is an open-source and extensible framework that allows evaluating toolsagainst (currently) 9 different annotators on 11 differentdatasets within 6 different experiment types. By integrat-ing such a large number of datasets, experiment types andframeworks, GERBIL allows users to evaluate their toolsagainst other semantic entity annotation systems (short: en-tity annotation systems) by using exactly the same setting,leading to fair comparisons based on exactly the same mea-sures. While the evaluation core of GERBIL is based on theBAT-framework [7], our approach goes beyond the state ofthe art in several respects:

• GERBIL provides persistent URLs for experimentalsettings. Hence, by using GERBIL for experiments,tool developers can ensure that the settings for theirexperiments (measures, datasets, versions of the ref-erence frameworks, etc.) can be reconstructed in aunique manner in future works.

• Through experiment URLs, GERBIL also addressesthe problem of archiving experimental results and al-lows end users to gather all pieces of information re-quired to choose annotation frameworks for practicalapplications.

• GERBIL aims to be a central repository for annotationresults without being a central point of failure: Whilewe make experiment URLs available, we also provideusers directly with their results to ensure that they usethem locally without having to rely on GERBIL.

• The results of GERBIL are published in a machine-readable format. In particular, our use of DataID [2]and DataCube [9] to denote tools and datasets ensuresthat results can be easily combined and queried (forexample to study the evolution of the performance offrameworks) while the exact configuration of the ex-periments remains uniquely reconstructable. By thesemeans, we also tackle the problem of reproducibility.

• Through the provision of results on different datasetsof different types and the provision of results on asimple user interface, GERBIL also provides means toquickly gain an overview of the current performance ofannotation tools, thus providing (1) developers withinsights pertaining to the type of data on which theiraccuracy needs improvement and (2) end users withinsights allowing them to choose the right tool for thetasks at hand.

• With GERBIL we introduce the notion of knowledgebase-agnostic benchmarking of entity annotation sys-tems through generalized experiment types. By thesemeans, we allow benchmarking tools against referencedatasets from any domain grounded in any referenceknowledge base.

To ensure that the GERBIL framework is useful to bothend users and tool developers, its architecture and interfacewere designed with the following principles in mind:

• Easy integration of annotators: We provide a wrap-ping interface that allows annotators to be evaluatedvia their REST interface. In particular, we integrated

6 additional annotators not evaluated against each otherin previous works (e.g., [7]).

• Easy integration of datasets: We also provide meansto gather datasets for evaluation directly from data ser-vices such as DataHub.1 In particular, we added 6 newdatasets to GERBIL.

• Easy addition of new measures: The evaluationmeasures used by GERBIL are implemented as inter-faces. Thus, the framework can be easily extendedwith novel measures devised by the annotation com-munity.

• Extensibility: GERBIL is provided as an open-sourceplatform2 that can be extended by members of thecommunity both to new tasks and different purposes.

• Diagnostics: The interface of the tool was designed toprovide developers with means to easily detect aspectsin which their tool(s) need(s) to be improved.

• Portability of results: We generate human- andmachine-readable results to ensure maximum useful-ness and portability of the results generated by ourframework.

In the rest of this paper, we present and evaluate GERBIL.We begin by giving an overview of related work. Thereafter,we present the GERBIL framework. We focus in particularon how annotators and datasets can be added to GERBILand give a short overview of the annotators and tools thatare currently included in the framework. We then presentan evaluation of the framework that aims to quantify theeffort necessary to include novel annotators and datasets tothe framework. We conclude with a discussion of the currentstate of GERBIL and a presentation of future work. Moreinformation can be found at our project webpage http://

gerbil.aksw.org and at the code repository page https:

//github.com/AKSW/gerbil. The online version of GERBILcan be accessed at http://gerbil.aksw.org/gerbil.

2. RELATED WORKNamed Entity Recognition and Entity Linking have gained

significant momentum with the growth of Linked Data andstructured knowledge bases. Over the last few years, theproblem of result comparability has thus led to the develop-ment of a handful of frameworks.

The BAT-framework [7] is designed to facilitate the bench-marking of named entity recognition (NER), named entitydisambiguation (NED) – also known as linking (NEL) – andconcept tagging approaches. BAT compares seven exist-ing entity annotation approaches using Wikipedia as ref-erence. Moreover, it defines six different task types, five dif-ferent matchings and six evaluation measures providing fivedatasets. Rizzo et al. [35] present a state-of-the-art studyof NER and NEL systems for annotating newswire and mi-cropost documents using well-known benchmark datasets,namely CoNLL2003 and Microposts 2013 for NER as well

1http://datahub.io2Available at http://gerbil.aksw.org.

1134

http://gerbil.aksw.org


https://github.com/AKSW/gerbil

https://github.com/AKSW/gerbil

http://gerbil.aksw.org/gerbil

http://datahub.io


as AIDA/CoNLL and Microposts2014 [3] for NED. The au-thors propose a common schema, named the NERD ontol-ogy,3 to align the different taxonomies used by various ex-tractors. To tackle the disambiguation ambiguity, they pro-pose a method to identify the closest DBpedia resource by(exact-)matching the entity mention.

Over the course of the last 25 years several challenges,workshops and conference dedicated themselves to the com-parable evaluation of information extraction (IE) systems.Starting in 1993, the Message Understanding Conference(MUC) introduced a first systematic comparison of infor-mation extraction approaches [42]. Ten years later, theConference on Computational Natural Language Learning(CoNLL) started to offer a shared task on named entityrecognition and published the CoNLL corpus [43]. In addi-tion, the Automatic Content Extraction (ACE) challenge [10],organized by NIST, evaluated several approaches but wasdiscontinued in 2008. Since 2009, the text analytics con-ference hosts the workshop on knowledge base population(TAC-KBP) [22] where mainly linguistic-based approachesare published. The Senseval challenge, originally concernedwith classical NLP disciplines, has widened its focus in 2007and changed its name to SemEval to account for the re-cently recognized impact of semantic technologies [19]. TheMaking Sense of Microposts workshop series (#Microposts)established in 2013 an entity recognition and in 2014 an en-tity linking challenge thereby focusing on tweets and micro-posts [37]. In 2014, Carmel et al. [5] introduced one of thefirst Web-based evaluation systems for NER and NED andthe centerpiece of the entity recognition and disambiguation(ERD) challenge. Here, all frameworks are evaluated againstthe same unseen dataset and provided with correspondingresults.

GERBIL goes beyond the state of the art by extendingthe BAT-framework as well as [35] in several dimensionsto enhance reproducibility, diagnostics and publishability ofentity annotation systems. In particular, we provide 6 addi-tional datasets and 6 additional annotators. The frameworkaddresses the lack of treatment of NIL values within theBAT-framework and provides more wrapping approaches forannotators and datasets. Moreover, GERBIL provides per-sistent URLs for experiment results, unique URIs for frame-works and datasets, a machine-readable output and auto-matic dataset updates from data portals. Thus, it allowsfor a holistic comparison of existing annotators while sim-plifying the archiving of experimental results. Moreover,our framework offers opportunities for the fast and simpleevaluation of entity annotation system prototypes via novelNIF-based [15] interfaces, which are designed to simplify theexchange of data and binding of services.

3. THE GERBIL FRAMEWORK

3.1 Architecture OverviewGERBIL abides by a service-oriented architecture driven

by the model-view-controller pattern (see Figure 1). Entityannotation systems, datasets and configurations like exper-iment type, matching or measure are implemented as con-troller interfaces easily pluggable to the core controller. Theoutput of experiments as well as descriptions of the vari-ous components are stored in a serverless database for fast

3http://nerd.eurecom.fr/ontology

deployment. Finally, the view component displays configu-ration options respectively renders experiment results deliv-ered by the main controller communication with the diverseinterfaces and the database.

3.2 FeaturesExperiments run in our framework can be configured in

several manners. In the following, we present some of themost important parameters of experiments available in GER-BIL.

3.2.1 Experiment typesAn experiment type defines the way used to solve a certain

problem when extracting information. Cornolti et al.’s [7]BAT-framework offers six different experiment types, namely(scored) annotation (S/A2KB), disambiguation (D2KB) –also known as linking –, (scored respectively ranked) conceptannotation (S/R/C2KB) of texts. In [35], the authors pro-pose two types of experiments, focusing on highlighting thestrengths and weaknesses of the analyzed systems. Thereby,performing i) entity recognition, i.e., the detection of the ex-act match of the pair entity mention and type (e.g., detectingthe mention Barack Obama and typing it as a Person), andii) entity linking, where an exact match of the mention isgiven and the associated DBpedia URI has to be linked (e.g.,locating a resource in DBpedia which describes the mentionBarack Obama). This work differs from the previous onefor experimenting in entity recognition, and on annotatingentities to a RDF knowledge base.

GERBIL reuses the six experiments provided by the BAT-framework and extends them by the idea to not only link toWikipedia but to any knowledge base K. One major formalupdate of the measures in GERBIL is that in addition toimplementing experiment types from previous frameworks,it also measures the influence of NIL annotations, i.e., thelinking of entities that are recognized as such but cannotbe linked to any resource from the reference knowledge baseK. For example, the string Ricardo Usbeck can be recog-nized as a person name by several tools but cannot be linkedto Wikipedia/DBpedia, as Ricardo does not have a URI inthese reference datasets. Our framework extends the experi-ments types of [7] as follows: Let m = (s, l, d, c) ∈M denotean entity mention in document d ∈ D with start positions, length l and confidence score c ∈ [0, 1]. Note that someframeworks might not return (1) a position s or a length lfor a mention, in which case we set s = 0 and l = 0; (2) ascore c, in which case we set c = 1.

We implement six types of experiments:

1. D2KB: The goal of this experiment type is to map aset of given entities mentions (i.e., a subset µ ⊆M) toentities from a given knowledge base or to NIL. For-mally, this is equivalent to finding a mapping a : µ→K ∪ {NIL}. In the classical setting for this task, thestart position, the length and the score of the mentionsmi are not taken into consideration.

2. A2KB: This task is the classical NER/D task, thusan extension of the D2KB task. Here two functionsare to be found. First, the entity mentions need tobe extracted from a document set D. To this end, anextraction function ex : D → 2M must be computed.The aim of the second step is then to match the results

1135

http://nerd.eurecom.fr/ontology

Benchmark Core

Web service calls

Dataset Wrapper

Web service calls

Interface View

AnnotatorWrapper

Interface View

Open Datasets

Configuration(Model)

...

Benchmark Core

Your Annotator

Your DatasetDataHub.io

GERBIL Core Controller

Persistent Experiment Database

(Model)

Figure 1: Overview of GERBIL’s abstract architecture. Interfaces to users and providers of datasets andannotators are marked in blue.

of ex to entities from K∪{NIL} by devising a functiona as in the D2KB task.

3. Sa2KB: Sa2KB is an extension of A2KB where thescores ci ∈ [0, 1] of the mentions detected by the ap-proach are taken into consideration. These scores arethen used during the evaluation.

4. C2KB: The concept tagging task C2KB aims to detectentities when given a document. Formally, the taggingfunction tag simply returns a subset of K for eachinput document d.

5. Sc2KB: This task is an extension of C2KB where thetagging function returns a subset of K × [0, 1] for eachinput document d.

6. Rc2KB: In this particular extension of C2KB, the tag-ging function returns a sorted list of resources from K,

i.e., an element of K∗, where K∗ =∞⋃i=0

Ki.

With this extension, our framework can now deal withgold standard datasets and annotators that link to any knowl-edge base, e.g., DBpedia, BabelNet [30] etc., as long as thenecessary identifiers are URIs. We were thus able to imple-ment 6 new gold standard datasets usable for each exper-iment type, cf. Section 3.3, and 6 new annotators linkingentities to any knowledge base instead of solely Wikipedialike in previous works, cf. Section 3.2.4. With this extensibleinterface, GERBIL can be extended to deal with supplemen-tary experiment types, e.g., entity salience [7], entity detec-tion [39], typing [35], word sense disambiguation (WSD) [27]and relation extraction [39]. These categories of experimenttypes will be added to GERBIL in next versions.

3.2.2 MatchingA matching defines which conditions the result of an an-

notator has to fulfill to be a correct result. In case of existingredirections, we assume an implicit matching function to ac-count for the many-to-one relation [7]. The first matching

type M used for the C2KB, Rc2KB and Sc2KB experimentsis the strong entity matching. Here, each mention is mappedto an entity of the knowledge base K via a matching func-tion f with f(m) ∈ K ∪ {NIL}. Following this matching, asingle entity mention m = (s, l, d, c) returned by the anno-tator is correct iff it matches exactly with one of the entitymentions m′ = (s′, l′, d, c′) in the gold standard G(d) of d [7].Formally,

M(m,G) =

{1 iff ∃m′ ∈ G, f(m) = f(m′),

0 else.(1)

For the D2KB experiments, the matching is expanded tothe strong annotation matching and includes the correct po-sition of the entity mention inside the document:

Me(m,G) =

1 iff ∃m′ ∈ G : f(m) = f(m′) ∧ s = s′∧

l = l′,

0 else.

(2)

The strong annotation matching can be used for A2KBand Sa2KB experiments, too. However, in practice this ex-act matching can be misleading. A document can contain agold standard named entity like “President Barack Obama”while the result of an annotator only marks“Barack Obama”as named entity. Using an exact matching leads to weightingthis result as wrong while a human might rate it as correct.Therefore, the weak annotation matching relaxes the con-ditions of the strong annotation matching. Thus, a correctannotation has to be linked to the same entity and mustoverlap the annotation of the gold standard:

1136

Mw(m,G) =

1 iff ∃m′ ∈ G, f(m) = f(m′) ∧ (

(s ≤ s′ ∧ (s+ l) ≤ (s′ + l′))

∨(s ≥ s′ ∧ (s+ l) ≥ (s′ + l′))

∨(s ≤ s′ ∧ (s+ l) ≥ (s′ + l′))

∨(s ≥ s′ ∧ (s+ l) ≤ (s′ + l′)))

0 else.

(3)

3.2.3 MeasuresCurrently, GERBIL offers six measures subdivided into

two groups and derived from the BAT-framework, namelythe micro- and the macro-group of precision, recall and f-measure. At the moment, those measures ignore NIL an-notations, i.e., if a gold standard dataset contains entitiesthat are not contained in the target knowledge base K andan annotator detects the entity and links it to any URI,emerging novel URI or NIL, this will always result in a false-positive evaluation. To alleviate this problem, GERBIL al-lows adding additional measures to evaluate the results ofannotators regarding the heterogeneous landscape of goldstandard datasets.

3.2.4 AnnotatorsGERBIL aims to reduce the amount of work required to

compare existing as well as novel annotators in a compre-hensive and reproducible way. To this end, we provide twomain approaches to evaluating entity annotation systemswith GERBIL.

1. BAT-framework Adapter

Within BAT, annotators can be implemented by wrap-ping using a Java-based interface. Since GERBIL isbased on the BAT-framework, annotators of this frame-work can be added to GERBIL easily. Due to thecommunity effort behind GERBIL, we could raise thenumber of published annotators from 5 to 9. We in-vestigated the effort to implement a BAT-frameworkadapter in contrast to evaluation efforts done withouta structured evaluation framework in Section 4.

2. NIF-based Services: GERBIL implements means tounderstand NIF-based [15] communication over web-service in two ways. First, if the server-side imple-mentation of annotators understands NIF-documentsas input and output format, GERBIL and the frame-work can simply exchange NIF-documents.4 Thus,novel NIF-based annotators can be deployed efficientlyinto GERBIL and use a more robust communicationformat compared to the amount of work necessary fordeploying and writing a BAT-framework adapter. Sec-ond, if developers do not want to publish their APIsor write source code, GERBIL offers the possibility forNIF-based webservices to be tested online by provid-ing their URI and name only.5 GERBIL does not storethese connections in terms of API keys or URLs butstill offers the opportunity of persistent experiment re-sults.

4We describe the exact requirements to the structure of theNIF document on our project website’s wiki as NIF offersseveral ways to build a NIF-based document or corpus.5http://gerbil.aksw.org/gerbil/config

Currently, GERBIL offers 9 entity annotation systemswith a variety of features, capabilities and experiments. Inthe following, we present current state-of-the-art approachesboth available or unavailable in GERBIL.

1. Cucerzan: As early as in 2007, Cucerzan presented aNED approach based on Wikipedia [8]. The approachtries to maximize the agreement between contextualinformation of input text and a Wikipedia page as wellas category tags on the Wikipedia pages. The test datais still available6 but since we can safely assume thatthe Wikipedia page content changed a lot since 2006,we do not use it in our framework, nor we are awareof any publication reusing this data. Furthermore, wewere not able to find a running webservice or sourcecode for this approach.

2. Wikipedia Miner: This approach was introducedin [25] in 2008 and is based on different facts like priorprobabilities, context relatedness and quality, whichare then combined and tuned using a classifier. Theauthors evaluated their approach based on a subsetof the AQUAINT dataset.7 They provide the sourcecode for their approach as well as a webservice8 whichis available in GERBIL.

3. Illinois Wikifier: In 2011, [34] presented an NED ap-proach for entities from Wikipedia. In this article, theauthors compare local approaches, e.g., using stringsimilarity, with global approaches, which use contextinformation and lead finally to better results. The au-thors provide their datasets9 as well as their software“Illinois Wikifier”10 online. Since “Illinois Wikifier” iscurrently only available as local binary and GERBIL issolely based on webservices we excluded it from GER-BIL for the sake of comparability and server load.

4. DBpedia Spotlight: One of the first semantic ap-proaches [24] was published in 2011, this frameworkcombines NER and NED approach based upon DBpe-dia.11 Based on a vector-space representation of enti-ties and using the cosine similarity, this approach hasa public (NIF-based) webservice12 as well as its onlineavailable evaluation dataset.13

5. TagMe 2: TagMe 2 [13] was published in 2012 and isbased on a directory of links, pages and an inlink graphfrom Wikipedia. The approach recognizes named enti-ties by matching terms with Wikipedia link texts anddisambiguates the match using the in-link graph and

6http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/7http://www.nist.gov/tac/data/data_desc.html#AQUAINT8http://wikipedia-miner.cms.waikato.ac.nz/9http://cogcomp.cs.illinois.edu/page/resource_view/4

10http://cogcomp.cs.illinois.edu/page/software_view/33

11https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Known-uses

12https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service

13http://wiki.dbpedia.org/spotlight/isemantics2011/evaluation

1137

http://gerbil.aksw.org/gerbil/config

http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/

http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/

http://www.nist.gov/tac/data/data_desc.html#AQUAINT

http://www.nist.gov/tac/data/data_desc.html#AQUAINT

http://wikipedia-miner.cms.waikato.ac.nz/

http://cogcomp.cs.illinois.edu/page/resource_view/4

http://cogcomp.cs.illinois.edu/page/resource_view/4

http://cogcomp.cs.illinois.edu/page/software_view/33

http://cogcomp.cs.illinois.edu/page/software_view/33

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Known-uses

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Known-uses

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service

http://wiki.dbpedia.org/spotlight/isemantics2011/evaluation

http://wiki.dbpedia.org/spotlight/isemantics2011/evaluation

the page dataset. Afterwards, TagMe 2 prunes theidentified named entities which are considered as non-coherent to the rest of the named entities in the in-put text. The authors publish a key-protected webser-vice14 as well as their datasets15 online. The sourcecode, licensed under Apache 2 licence can be obtaineddirectly from the authors. The datasets comprise onlyfragments of 30 words and less of full documents andwill not be part of the current version of GERBIL.

6. AIDA: The AIDA approach [17] relies on coherencegraph building and dense subgraph algorithms and isbased on the YAGO216 knowledge base. Although theauthors provide their source code, a webservice andtheir dataset which is a manually annotated subset ofthe 2003 CoNLL share task [43], GERBIL will not usethe webservice since it is not stable enough for regularreplication purposes at the moment of this publica-tion.17 That is, the AIDA team discourages the usebecause they constantly switch the underlying entityrepository, and tune parameters.

7. NERD-ML: In 2013, [11] proposed an approach forentity recognition tailored for extracting entities fromtweets. The approach relies on a machine learning clas-sification of the entity type given a rich feature vectorcomposed of a set of linguistic features, the output of aproperly trained Conditional Random Fields classifierand the output of a set of off-the-shelf NER extractorssupported by the NERD Framework. The follow-up,NERD-ML [35], improved the classification task by re-designing the selection of the features. The authorsassessed the NERD-ML’s performance on both micro-posts and newswire domains. NERD-ML has a publicwebservice which is part of GERBIL.18

8. KEA NER/NED: This approach is the successor ofthe approach introduced in [41] which is based on afine-granular context model taking into account het-erogeneous text sources as well as text created by au-tomated multimedia analysis. The source texts canhave different levels of accuracy, completeness, granu-larity and reliability which influence the determinationof the current context. Ambiguity is solved by select-ing entity candidates with the highest level of probabil-ity according to the predetermined context. The newimplementation begins with the detection of groups ofconsecutive words (n-gram analysis) and a lookup of allpotential DBpedia candidate entities for each n-gram.The disambiguation of candidate entities is based on ascoring cascade. KEA is available as NIF-based web-service.19

9. WAT: WAT is the successor of TagME [13].20 Thenew annotator includes a re-design of all TagME com-ponents, namely, the spotter, the disambiguator, and

14http://tagme.di.unipi.it/15http://acube.di.unipi.it/tagme-dataset/16http://www.mpi-inf.mpg.de/yago-naga/yago/17https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/

18http://nerd.eurecom.fr19http://s16a.org/kea20http://github.com/nopper/wat

the pruner. Two disambiguation families were newlyintroduced: graph-based algorithms for collective en-tity linking based and vote-based algorithms for localentity disambiguation (based on the work of Ferrag-ina et al. [13]). The spotter and the pruner can betuned using SVM linear models. Additionally, the li-brary can be used as a D2KB-only system by feedingappropriate mention spans to the system.

10. AGDISTIS: This approach [44] is a pure entity dis-ambiguation approach (D2KB) based on string sim-ilarity measures, an expansion heuristic for labels tocope with co-referencing and the graph-based HITS al-gorithm. The authors published datasets21 along withtheir source code and an API.22 AGDISTIS can onlybe used for the D2KB task.

11. Babelfy: The core of this approach draws on the useof random walks and a densest subgraph algorithm totackle the word sense disambiguation and entity link-ing tasks jointly in a multilingual setting [27] thanksto the BabelNet23 semantic network [30]. Babelfy hasbeen evaluated using six datasets: three from earlierSemEval tasks [33, 29, 28], one from a Senseval task[38] and two already used for evaluating AIDA [17,16]. All of them are available online but distributedthroughout the Web. Additionally, the authors offera webservice that is limited to 100 requests per daywhich are extensible for research purposes [26].24

12. Dexter: This approach [6] is an open-source imple-mentation of an entity disambiguation framework. Thesystem was implemented in order to simplify the imple-mentation of an entity linking approach and allows toreplace single parts of the process. The authors imple-mented several state-of-the-art disambiguation meth-ods. Results in this paper are obtained using an imple-mentation of the original TagMe disambiguation func-tion. Moreover, Ceccarelli et al. provide the sourcecode25 as well as a webservice.

Table 1 compares the implemented annotation systems ofGERBIL and the BAT-Framework. While AGDISTIS hasbeen in the source code of the BAT-Framework providedby a third-party after publication of Cornolti et al.’s ini-tial work [7] in 2014, GERBIL’s community effort led to theimplementation of overall 6 new annotators as well as thebefore mentioned generic NIF-based annotator. The AIDAannotator as well as the “Illinois Wikifier” will not be avail-able in GERBIL since we restrict ourselves to webservices.However, these algorithms can be integrated at any time assoon as their webservices are available.

3.3 DatasetsTable 2 shows the heterogeneity of datasets used for prior

evaluations while Table 3 presents an overview of the datasetsthat were used to evaluate some well-known entity annota-tors in previous works. These tables make clear that thenumbers and types of used datasets varies a lot, thus pre-venting a fast comparison of annotation systems.

21https://github.com/AKSW/n3-collection22https://github.com/AKSW/AGDISTIS23http://babelnet.org24http://babelfy.org25http://dexter.isti.cnr.it

1138

http://tagme.di.unipi.it/

http://acube.di.unipi.it/tagme-dataset/

http://www.mpi-inf.mpg.de/yago-naga/yago/

https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/



http://nerd.eurecom.fr

http://s16a.org/kea

http://github.com/nopper/wat

https://github.com/AKSW/n3-collection

https://github.com/AKSW/AGDISTIS

http://babelnet.org

http://babelfy.org

http://dexter.isti.cnr.it

Yea

r

AC

E

Wik

i

AQ

UA

INT

MSN

BC

IIT

B

Mei

j

AID

A/C

oN

LL

N3

collec

tion

KO

RE

50

Wik

i-D

isam

b30

Wik

i-A

nnot3

0

Sp

otl

ight

Corp

us

Sem

Eva

l-2013

task

12

Sem

Eva

l-2007

task

7

Sem

Eva

l-2007

task

17

Sen

seva

l-3

NIF

-base

dco

rpus

Mic

rop

ost

s2014

Soft

ware

available

?

Web

serv

ice

available

?

Cucerzan 2007 3Wikipedia

2008 3* 3MinerIllinois Wikifier 2011 3 3 3* 3 3Spotlight 2011 3 3 3AIDA 2011 3 3 3**TagMe 2 2012 3 3 3 3Dexter 2013 3 3KEA 2013 3WAT 2013 3 3AGDISTIS 2014 3 3 3 3 3 3 3 3 3 3Babelfy 2014 3 3 3 3 3 3 3NERD-ML 2014 3 3 3 3

BAT-2013 3 3 3 3 3 3 3* 3

FrameworkNERD

2014 3 3 3 3 3FrameworkGERBIL 2014 3 3 3 3 3 3 3* 3 3 3 3 3 3 3

Table 3: Comparison of annotators and datasets with indication whether software or datasets respectivelywebservices are available for reproduction. ∗ indicates that only a subset has been used to evaluate thisannotator. ∗∗ indicate that the webservice is not meant to be used within scientific evaluations due tounstable backends.

BAT allows evaluating the performance of different a-pproaches using five datasets, namely AQUAINT, MSNBC,IITB, Meij and AIDA/CoNLL. With GERBIL, we activateone more dataset already implemented by the authors, namelyACE2004 from Ratinov et al. [34]. Furthermore, we im-plemented a dataset wrapper for the Microposts2014 cor-pus which has been used to evaluate NERD-ML [35]. Thedataset itself was introduced in 2014 [3] and consists of 3500tweets especially related to event data. Moreover, we capi-talize upon the uptake of publicly available, NIF based cor-pora over the last years [40, 36].26 To this end, GERBILimplements a Java-based NIF [15] reader and writer modulewhich enables loading arbitrary NIF document collections,as well as the communication to NIF-based webservices. Ad-ditionally, we integrated four NIF corpora, i.e., the RSS-500and reuters-128 dataset,27 as well as the Spotlight Corpusand the KORE 50 dataset.28

The extensibility of the datasets in GERBIL is further-more ensured by allowing users to upload or use alreadyavailable NIF datasets from DataHub. However, GERBILis currently only importing already available datasets. GER-BIL will regularly check whether new corpora are availableand publish them for benchmarking after a manual qualityassurance cycle which ensures their usability for the imple-mented configuration options. Additionally, users can up-

26http://datahub.io/dataset?license_id=cc-by&q=NIF27https://github.com/AKSW/n3-collection28http://www.yovisto.com/labs/ner-benchmarks/

load their NIF-corpora directly to GERBIL avoiding theirpublication in publicly available sources. This option allowsfor rapid testing of entity annotation systems with closedsource or licenced datasets.

Some of the datasets shown in Table 3 are either not yetimplemented due to size and server load limitations, i.e.,Wiki-Disamb30 and Wiki-Annot30, or due their original ex-periment type. In particular, the Senseval-3 as well as thedifferent SemEval datasets demand as experiment type wordsense disambiguation and thereby linking to BabelNet [30]or Wordnet [12], which is not yet covered in GERBIL. Still,GERBIL offers currently 11 state-of-the-art datasets reach-ing from newswire and twitter to encyclopedic corpora ofvarious amounts of texts and entities. Due to license is-sues we are only able to provide downloads for 9 of themdirectly but we provide instructions to obtain the others onour project wiki.

Table 4 depicts the features of the current datasets avail-able in GERBIL. These provide a broad evaluation groundleveraging the possibility for sophisticated tool diagnostics.

3.4 OutputGERBIL’s main aim is to provide comprehensive, repro-

ducible and publishable experiment results. Hence, GER-BIL’s experimental output is represented as a table con-taining the results, as well as embedded JSON-LD29 RDFdata using the RDF DataCube vocabulary [9]. We ensure

29http://www.w3.org/TR/json-ld/

1139

http://datahub.io/dataset?license_id=cc-by&q=NIF

https://github.com/AKSW/n3-collection

http://www.yovisto.com/labs/ner-benchmarks/

http://www.w3.org/TR/json-ld/

Corpus Topic |Documents| Avg. Entity/Doc.

ACE2004 news 57 4.44AIDA/CoNLL news 1393 19.97AQUAINT news 50 14.54IITB mixed 103 109.22KORE 50 mixed 50 2.86Meij tweets 502 1.62Microposts2014 tweets 3505 0.65MSNBC news 20 32.50N3 Reuters-128 news 128 4.85N3 RSS-500 RSS-feeds 500 0.99Spotlight Corpus news 58 5.69

Table 4: Features of the datasets and their documents.

BAT-Framework

GER-BIL

Experi-ment

[25]WikipediaMiner

3 3 SA2KB

[34] Illinois Wikifier 3 (3) SA2KB[24] Spotlight 3 3 SA2KB[13] TagMe 2 3 3 SA2KB[17] AIDA 3 (3) SA2KB[41] KEA 3 SA2KB[32] WAT 3 SA2KB[44] AGDISTIS (3) 3 D2KB[27] Babelfy 3 SA2KB[35] NERD-ML 3 SA2KB[6] Dexter 3 SA2KB

NIF-basedAnnotator

3 any

Table 1: Overview of implemented annotator sys-tems. Brackets indicate the existence of the imple-mentation of the adapter but also the inability touse it in the live system.

a detailed description of each component of an experimentas well as machine-readable, interlinkable results followingthe 5-star Linked Data principles. Moreover, we provide apersistent and time-stamped URL for each experiment, seeTable 5.

RDF DataCube is a vocabulary standard and can be usedto represent fine-grained multidimensional, statistical datawhich is compatible with the Linked SDMX [4] standard.Every GERBIL experiment is modelled as qb:Dataset con-taining the individual runs of the annotators on specific cor-pora as qb:Observations. Each observation features theqb:Dimensions experiment type, matching type, annota-tor, corpus, and time. The six evaluation measures offeredby GERBIL as well as the error count are expressed asqb:Measures. To include further metadata, annotator andcorpus dimension properties link DataID [2] descriptions ofthe individual components.

GERBIL uses the recently proposed DataID [2] ontologythat combines VoID [1] and DCAT [21] metadata with Prov-O [20] provenance information and ODRL [23] licenses to de-scribe datasets. Besides metadata properties like titles, de-scriptions and authors, the source files of the open datasetsthemselves are linked as dcat:Distributions, allowing di-

Dataset Format Experiment

ACE2004 MSNBC Sa2KBWiki ? Sa2WAQUAINT ? Sa2KBMSNBC MSNBC Sa2KBIITB XML Sa2KBMeij TREC Rc2WAIDA/CoNLL CoNLL Sa2KBN3 collection NIF/RDF Sa2KBKORE 50 NIF/RDF Sa2KBWiki-Disamb30 tab-separated Sa2KBWiki-Annot30 tab-separated Sa2KBSpotlight Corpus NIF/RDF Sa2KBSemEval-2013 task 12 XML/? WSD/Sa2KBSemEval-2007 task 7 XML/? WSDSemEval-2007 task 17 XML/? WSDSenseval-3 XML/? WSDMicroposts2014 Microposts2014 Sa2KB

Table 2: Datasets and their formats. A ? indicatesvarious inline or keyfile annotation formats. Theexperiments follow their definition in Section 3.2

rect access to the evaluation corpora. Furthermore, ODRLlicense specifications in RDF are linked via dc:license, po-tentially facilitating automatically adjusted processing of li-censed data by NLP tools. Licenses are further specified viadc:rights, including citations of the relevant publications.

To describe annotators in a similar fashion, we extendedDataID for services. The class Service, to be describedwith the same basic properties as dataset, was introduced.To link an instance of a Service to its distribution thedatid:distribution property was introduced as super prop-erty of dcat:distribution, i.e., the specific URI the ser-vice can be queried at. Furthermore, Services can have anumber of datid:Parameters and datid:Configurations.Datasets can be linked via datid:input or datid:output.

Offering such detailed and structured experimental resultsopens new research avenues in terms of tool and dataset di-agnostics to increase decision makers’ ability to choose theright settings for the right use case. Next to individual con-figurable experiments, GERBIL offers an overview of recentexperiment results belonging to the same experiment andmatching type in the form of a table as well as sophisticated

1140

Annotator Dataset F1-micro

DBpedia Spotlight IITB 0.450Babelfy IITB 0.182NERD-ML IITB 0.489WAT IITB 0.202DBpedia Spotlight KORE50 0.265Babelfy KORE50 0.530NERD-ML KORE50 0.238WAT KORE50 0.523

Table 5: Results of an example experiment.It is accessible at http://gerbil.aksw.org/gerbil/

experiment?id=201503050003

visualizations,30 see Figure 2. This allows for a quick com-parison of tools and datasets on recently run experimentswithout additional computational effort.

Figure 2: Example spider diagram of recent A2KBexperiments with strong annotation matching de-rived from our online interface

4. EVALUATIONTo ensure the practicability and convenience of the GER-

BIL framework, we investigated the effort needed to useGERBIL for the evaluation of novel annotators. To achievethis goal, we surveyed the workload necessary to implementa novel annotator into GERBIL compared to the implemen-tation into previous diverse frameworks.

Our survey comprised five developers with expert-levelprogramming skills in Java. Each developer was asked toevaluate how much time he/she needed to write the codenecessary to evaluate his/her framework on a new dataset.

Overall, the developers reported that they needed between1 and 4 hours to achieve this goal (4x 1-2h, 1x 3-4h), see Fig-ure 3. Importantly, all developers reported that they needed

30http://gerbil.aksw.org/gerbil/overview

Figure 3: Comparison of effort needed to implementan adapter for an annotation system with and with-out GERBIL.

either the same or even less time to integrate their annota-tor into GERBIL. This result in itself is of high practicalsignificance as it means that by using GERBIL, developerscan evaluate on (currently) 11 datasets using the same ef-fort they needed for 1, which is a gain of more than 1100%.Moreover, all developers reported they felt comfortable—4points on average on a 5-point Likert scale between veryuncomfortable (1) and very comfortable (5)—implementingthe annotator in GERBIL. Further developers were invitedto complete the survey, which is available at our project web-site. Even though small, this evaluation suggests that im-plementing against GERBIL does not lead to any overhead.On the contrary, GERBIL significantly improves the time-to-evaluation by offering means to benchmark and compareagainst other annotators respectively datasets within thesame effort frame previously required to evaluate on a singledataset.

An interesting side-effect of having all these frameworksand datasets in a central framework is that we can nowbenchmark the different frameworks with respect to theirruntimes within exactly the same experimental settings. Theseresults are of practical concern for end users of annotationframeworks as they are most commonly interested in boththe runtime and the quality of solutions. For example, weevaluated the runtimes of the different approaches in GER-BIL for the A2KB experiment type on the MSNBC dataset.The results of this experiment are shown in Figure 4.

5. CONCLUSION AND FUTURE WORKIn this paper, we presented and evaluated GERBIL, a

platform for the evaluation of annotation frameworks. WithGERBIL, we aim to push annotation system developers tobetter quality and wider use of their frameworks. Someof the main contributions of GERBIL include the provisionof persistent URLs for reproducibility and archiving. Fur-thermore, we implemented a generic adaptor for externaldatasets as well as a generic interface to integrate remoteannotator systems. The datasets available for evaluationin the previous benchmarking platforms for annotation wasextended by 6 new datasets. Moreover, 6 novel annotatorswere added to the platform. The evaluation of our frame-work by contributors suggests that adding an annotator to

1141

http://gerbil.aksw.org/gerbil/experiment?id=201503050003

http://gerbil.aksw.org/gerbil/experiment?id=201503050003

http://gerbil.aksw.org/gerbil/overview

Figure 4: Runtime per document of different ap-proaches in GERBIL for the A2KB experiment typeon the MSNBC dataset.

GERBIL demands 1 to 2 hours of work. Hence, while keep-ing the implementation effort previously required to evaluateon a single dataset, we allow developers to evaluate on (cur-rently) 11 times more datasets. The presented, web-basedfrontend allows for several use cases enabling laymen and ex-pert users to perform informed comparisons of semantic an-notation tools. The persistent URIs enhance the long termquotation in the field of information extraction. GERBILis not just a new framework wrapping existing technology.In comparison to earlier frameworks, it extends the state-of-the-art benchmarks by the capability of considering theinfluence of NIL attributes and the ability of dealing withdata sets and annotators that link to different knowledgebases. More information about GERBIL and its source codecan be found at the project’s website.

While developing GERBIL, we spotted several flaws in theformal model underlying previous benchmarking frameworkswhich we aim to tackle in the future. For example, the for-mal specification underlying current benchmarking frame-works for annotation does not allow using the scores assignedby the annotators for their results. To address this problem,we aim to develop/implement novel measures into GERBILthat make use of scores (e.g., Mean Reciprocal Rank). More-over, partial results are not considered within the evaluation.For example, during the disambiguation task, named enti-ties without Wikipedia URIs are not considered. This has asignificant impact of the number of true and false positivesand thus on the performance of some tools. Additionally,we will consider implementing a review function with whichusers can submit proposals for dataset changes to be evalu-ated by the GERBIL team. Furthermore, certain tasks seemto be too coarse. For example, we will consider splitting theSa2KB and the A2KB tasks into two subtasks: the first sub-task would measure how well tools perform at finding namedentities inside the text (NER task) while the second wouldevaluate how well tools disambiguate those named entitieswhich have been found correctly (similar to the D2KB task).In the future, we also plan to provide information about thepoint in time since when an annotator is stable, i.e., thealgorithm underlying the webservice has not changed.

Acknowledgments. Parts ofthis work were supported bythe ESF and the Free State ofSaxony, the FP7 projects Geo-Know (GA No. 318159), LIDER (GA No. 610782),Apps4EU (GA No. 325090), LinkedTV (GA No. 287911),ERC Starting Grant MultiJEDI No. 259234, Italy PRINARS-Technomedia as well as by the German Government,Federal Ministry of Education and Research under theproject number 03WKCJ4D.

6. ADDITIONAL AUTHORSCiro Baron (Leipzig University, Germany)Andreas Both (R&D, Unister GmbH, Germany)Martin Brummer (Leipzig University, Germany)Diego Ceccarelli (ISTI–CNR, Pisa, Italy)Marco Cornolti (University of Pisa, Italy)Didier Cherix (R&D, Unister GmbH, Germany)Bernd Eickmann (R&D, Unister GmbH, Germany)Paolo Ferragina (University of Pisa, Italy)Christiane Lemke (R&D, Unister GmbH, Germany)Andrea Moro (Sapienza University of Rome, Italy)Roberto Navigli (Sapienza University of Rome, Italy)Francesco Piccinno (University of Pisa, Italy)Giuseppe Rizzo (EURECOM, France)Harald Sack (HPI Potsdam, Germany)Rene Speck (Institute for Applied Informatics, Germany)Raphael Troncy (EURECOM, France)Jorg Waitelonis (HPI Potsdam, Germany)Lars Wesemann (R&D, Unister GmbH, Germany)

7. REFERENCES[1] K. Alexander, R. Cyganiak, M. Hausenblas, and

J. Zhao. Describing linked datasets with the voidvocabulary, 2011. http://www.w3.org/TR/void/.

[2] M. Brummer, C. Baron, I. Ermilov, M. Freudenberg,D. Kontokostas, and S. Hellmann. DataID: Towardssemantically rich metadata for complex datasets. In10th International Conference on Semantic Systems2014, 2014.

[3] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe,M. Stankovic, and A.-S. Dadzie. Making sense ofmicroposts (#microposts2014) named entityextraction & linking challenge. In Proceedings of 4thWorkshop on Making Sense of Microposts(#Microposts2014), 2014.

[4] S. Capadisli, S. Auer, and A.-C. Ngonga Ngomo.Linked SDMX data. Semantic Web Journal, 2013.

[5] D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P.Hsu, and K. Wang. ERD 2014: Entity recognition anddisambiguation challenge. SIGIR Forum, 2014.

[6] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, andS. Trani. Dexter: an open source framework for entitylinking. In Proceedings of the sixth internationalworkshop on Exploiting semantic annotations ininformation retrieval, 2013.

[7] M. Cornolti, P. Ferragina, and M. Ciaramita. Aframework for benchmarking entity-annotationsystems. In 22nd World Wide Web Conference, 2013.

[8] S. Cucerzan. Large-scale named entity disambiguationbased on wikipedia data. In Conference on Empirical

1142

Methods in Natural Language Processing-CoNLL,2007.

[9] R. Cyganiak, D. Reynolds, and J. Tennison. The RDFData Cube Vocabulary, 2014.http://www.w3.org/TR/vocab-data-cube/.

[10] G. R. Doddington, A. Mitchell, M. A. Przybocki,L. A. Ramshaw, S. Strassel, and R. M. Weischedel.The automatic content extraction (ace)program-tasks, data, and evaluation. In LREC, 2004.

[11] M. V. Erp, G. Rizzo, and R. Troncy. Learning withthe web: Spotting named entities on the intersectionof NERD and machine learning. In Proceedings of theMaking Sense of Microposts (#MSM2013) ConceptExtraction Challenge, 2013.

[12] C. Fellbaum. WordNet: An Electronic LexicalDatabase. MIT Press, 1998.

[13] P. Ferragina and U. Scaiella. Fast and AccurateAnnotation of Short Texts with Wikipedia Pages.IEEE software, 2012.

[14] Y. Gil. Semantic challenges in getting work done,2014. Invited Talk at ISWC.

[15] S. Hellmann, J. Lehmann, S. Auer, and M. Brummer.Integrating NLP using Linked Data. In 12thInternational Semantic Web Conference, 2013.

[16] J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, andG. Weikum. KORE: keyphrase overlap relatedness forentity disambiguation. In Proceedings of CIKM, 2012.

[17] J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau,M. Pinkal, M. Spaniol, B. Taneva, S. Thater, andG. Weikum. Robust Disambiguation of NamedEntities in Text. In Conference on Empirical Methodsin Natural Language Processing, 2011.

[18] P. Jermyn, M. Dixon, and B. J. Read. Preparing cleanviews of data for data mining. ERCIM Work. onDatabase Res, 1999.

[19] A. Kilgarriff. Senseval: An exercise in evaluating wordsense disambiguation programs. 1st LREC, 1998.

[20] T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame,J. Cheney, D. Corsar, D. Garijo, S. Soiland-Reyes,S. Zednik, and J. Zhao. PROV-O: The PROVOntology, 2013. http://www.w3.org/TR/prov-o/.

[21] F. Maali, J. Erickson, and P. Archer. Data CatalogVocabulary (DCAT), 2014.http://www.w3.org/TR/vocab-dcat/.

[22] P. McNamee. Overview of the tac 2009 knowledgebase population track. 2009.

[23] M. McRoberts and V. Rodrıguez-Doncel. Open DigitalRights Language (ODRL) Ontology, 2014.http://www.w3.org/ns/odrl/2/.

[24] P. N. Mendes, M. Jakob, A. Garcia-Silva, andC. Bizer. DBpedia Spotlight: Shedding Light on theWeb of Documents. In 7th International Conferenceon Semantic Systems (I-Semantics), 2011.

[25] D. Milne and I. H. Witten. Learning to link withwikipedia. In 17th ACM CIKM, 2008.

[26] A. Moro, F. Cecconi, and R. Navigli. Multilingualword sense disambiguation and entity linking foreverybody. In Proc. of ISWC (P&D), 2014.

[27] A. Moro, A. Raganato, and R. Navigli. Entity Linkingmeets Word Sense Disambiguation: A UnifiedApproach. TACL, 2014.

[28] R. Navigli, D. Jurgens, and D. Vannella. SemEval-2013Task 12: Multilingual Word Sense Disambiguation. InProceedings of SemEval-2013, 2013.

[29] R. Navigli, K. C. Litkowski, and O. Hargraves.SemEval-2007 Task 07: Coarse-Grained EnglishAll-Words Task. In Proc. of SemEval-2007, 2007.

[30] R. Navigli and S. P. Ponzetto. BabelNet: Theautomatic construction, evaluation and application ofa wide-coverage multilingual semantic network.Artificial Intelligence, 2012.

[31] R. D. Peng. Reproducible research in computationalscience. Science (New York, Ny), 2011.

[32] F. Piccinno and P. Ferragina. From TagME to WAT: anew entity annotator. In Proceedings of the firstinternational workshop on Entity recognition &disambiguation, 2014.

[33] S. S. Pradhan, E. Loper, D. Dligach, and M. Palmer.SemEval-2007 task 17: English lexical sample, SRLand all words. In Proc. of SemEval-2007.

[34] L. Ratinov, D. Roth, D. Downey, and M. Anderson.Local and global algorithms for disambiguation towikipedia. In ACL, 2011.

[35] G. Rizzo, M. van Erp, and R. Troncy. Benchmarkingthe extraction and disambiguation of named entitieson the semantic web. In Proceedings of the 9thInternational Conference on Language Resources andEvaluation, 2014.

[36] M. Roder, R. Usbeck, S. Hellmann, D. Gerber, andA. Both. N3 - a collection of datasets for named entityrecognition and disambiguation in the nlp interchangeformat. In 9th LREC, 2014.

[37] M. Rowe, M. Stankovic, and A.-S. Dadzie, editors.Proceedings, 4th Workshop on Making Sense ofMicroposts (#Microposts2014): Big things come insmall packages, Seoul, Korea, 7th April 2014, 2014.

[38] B. Snyder and M. Palmer. The English all-words task.In Proc. of Senseval-3, pages 41–43, 2004.

[39] R. Speck and A.-C. N. Ngomo. Ensemble learning fornamed entity recognition. In The Semantic Web –ISWC 2014. 2014.

[40] N. Steinmetz, M. Knuth, and H. Sack. Statisticalanalyses of named entity disambiguation benchmarks.In 1st Workshop on NLP&DBpedia 2013, 2013.

[41] N. Steinmetz and H. Sack. Semantic multimediainformation retrieval based on contextual descriptions.In P. Cimiano, O. Corcho, V. Presutti, L. Hollink, andS. Rudolph, editors, The Semantic Web: Semanticsand Big Data. 2013.

[42] B. M. Sundheim. Tipster/muc-5: Informationextraction system evaluation. In Proceedings of the 5thConference on Message Understanding, 1993.

[43] E. F. Tjong Kim Sang and F. De Meulder.Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition. InProceedings of CoNLL-2003.

[44] R. Usbeck, A.-C. Ngonga Ngomo, M. Roder,D. Gerber, S. Coelho, S. Auer, and A. Both. Agdistis -graph-based disambiguation of named entities usinglinked data. In The Semantic Web – ISWC 2014. 2014.

1143