Entity-Centric Information Access with Human in the Loop ... · of these tools to biomedical data, which is subject to a recently started project on biomedical knowledge acquisition

Proceedings of the Biomedical NLP Workshop associated with RANLP 2017, pages 42–48,Varna, Bulgaria, 8 September 2017.

https://doi.org/10.26615/978-954-452-044-1_006

Entity-Centric Information Access with the Human-in-the-Loopfor the Biomedical Domains

Seid Muhie Yimam†, Steffen Remus†, Alexander Panchenko†,Andreas Holzinger‡, and Chris Biemann†

†Language Technology Group, Department of InformaticsUniversitat Hamburg, Germany

‡Research Unit HCI-KDDInstitute for Medical Informatics, Statistics and Documentation

Medical University Graz, Austria{yimam,remus,panchenko,biemann}@informatik.uni-hamburg.de

[email protected]

Abstract

In this paper, we describe the concept ofentity-centric information access for thebiomedical domain. With entity recog-nition technologies approaching accept-able levels of accuracy, we put forwarda paradigm of document browsing andsearching where the entities of the domainand their relations are explicitly mod-eled to provide users the possibility ofcollecting exhaustive information on re-lations of interest. We describe threeworking prototypes along these lines:NEW/S/LEAK, which was developed forinvestigative journalists who need a quickoverview of large leaked document col-lections; STORYFINDER, which is a per-sonalized organizer for information foundin web pages that allows adding enti-ties as well as relations, and is capa-ble of personalized information manage-ment; and adaptive annotation capabilitiesof WEBANNO, which is a general-purposelinguistic annotation tool. We will dis-cuss future steps towards the adaptationof these tools to biomedical data, whichis subject to a recently started project onbiomedical knowledge acquisition. A keydifference to other approaches is the cen-tering around the user in a Human-in-the-Loop machine learning approach, whereusers define and extend categories and en-able the system to improve via feedbackand interaction.

1 Introduction

Recently, knowledge management as a field facedseveral challenges. On one hand, sophisticatedtechnologies and standards were developed to sup-port knowledge-based modeling, such as domainontologies including Disease Ontology, MeSH,and Gene Ontology1 and the Semantic Web de-scription languages and infrastructures includingRDF, OWL, SPARQL and others2. On the otherhand, the current approaches face three major is-sues: (1) knowledge bottleneck: resources re-quired for knowledge management such as domainontologies are not available for many domains andlanguages; (2) the overall approach of knowledgemanagement did not get widely spread due to thefact that it imposes a large burden on the user,such as annotation or expertise with complex toolssuch as Protege3; (3) modeling entire domainsas large as the medical domain with (English-oriented) knowledge resources does not meet re-quirements of users, who are mostly specializingin a certain sub-field and also need to operate intheir local language.

We propose to reload this traditional heavy-weight top-down knowledge management ap-proach and replace it with a much simpler andpractical problem-oriented bottom-up approach.We choose the biomedical domain as the area ofinterest for our planning. Medical researchershave to process enormous amounts of literature –PubMed4 adds about half a million papers to itsindex each year. Literature search and reason-

1http://do-wiki.nubic.northwestern.edu;

http://ncbi.nlm.nih.gov/mesh; http://geneontology.org2http://www.w3.org/standards/semanticweb

3http://protege.stanford.edu

4http://www.ncbi.nlm.nih.gov/pubmed

42

https://doi.org/10.26615/978-954-452-044-1_006

ing is demanding, because of the need to revealand maintain many complex relationships betweennumerous sets of entities. In order to alleviatethe efforts of biomedical research related to litera-ture we propose a novel conception to informationmanagement based on bottom-up construction ofa problem-oriented ontology, called entity graph(EG) in this paper. Entity graphs provide a newtool for medical researchers that (1) help to doc-ument relations between biomedical entities in acompact intuitive and interpretable form; (2) gen-erate new relations in a semi-automatic way basedon corpus analysis; (3) communicate new biomed-ical knowledge in a form of an easily interpretableinteractive graph and (4) share knowledge and an-notations amongst researchers.

2 Related Work

An early conception of a system for personal in-formation management was Memex (Bush, 1945).The proposed design suggested that all documentsof a person should be indexed to be easily acces-sible for consultation and for sharing with otherpeople. Several decades later, the Web and socialnetworks implement this vision yet only partially.According to Davenport (1994), Knowledge Man-agement (KM) is a process of capturing, distribut-ing, and effectively using knowledge. Accordingto Gruber (1995), an ontology is an explicit spec-ification of conceptualization. Studer et al. (1998)defines ontology as a formal, explicit specificationof shared conceptualization. Multiple other infor-mal and formal definitions of ontology are pre-sented by Cimiano et al. (2014). Here “conceptu-alization” is a worldview, a system of conceptionsand their relations.

Ontologies can be either general or domain-specific. Today’s content management systems arelargely accessed with facetted search, i.e. with tax-onomically organized vocabularies forming a se-mantic facet. Users of the system must learn thevocabulary in order to assign the correct terms tonewly ingested documents and to perform effec-tive searches. The Cyc project (Lenat and Guha,1990) was an early ontology-driven attempt tomodel world knowledge. Jurisica et al. (1999) pre-sented an overview of using ontologies for infor-mation management. Later, knowledge manage-ment using ontologies was driven by the Seman-tic Web vision (Berners-Lee et al., 2001). Thiseventually led to the Linked Open Data cloud

of resources, containing a comprehensive collec-tion of interlinked ontologies. One limiting fac-tor of widespread usage of ontologies is the heavyburden of their manual construction: all con-cepts, attributes and relations in ontologies areadded and updated manually. Moreover, evenif suitable ontologies for a target domain exist,they do not come with mechanisms to recognizetheir concepts in unstructured text, motivating ap-proaches that learn ontologies from text (see Bie-mann (2005) and Buitelaar et al. (2005)).

Both EGs and ontologies aim at providing ashared explicit conceptualization of a certain do-main. However, there are several important dif-ferences between these two resources. First, EGsare task- and/or problem-specific descriptions ofa domain, while ontologies are usually designedas generic knowledge representations for a givendomain. Ontologies are commonly developed asgeneral-purpose resources that are supposed tomodel a certain domain without taking into ac-count specific needs of certain application. Thisleads in practice to the fact that most resourcesshould be specifically tailored to fit the need ofthe given task, problem or application. Alongthese lines, Hirst (2014) notes that the worldviewcaptured in ontologies is based on the author ofthe ontology, not on the user, and the knowledgeis not contextualized. We argue that this is oneof the key reasons of only moderate success ofontology-based knowledge management after 15years of development. Our approach will tacklethis shortcoming: entity graphs are a knowledgerepresentation tool that is designed to be strictlytask-oriented. Such a graph would contain onlyconcepts and relations relevant to the describedproblem at hand omitting any irrelevant details.

Mind maps (MMs) are visual diagrams thathelp to organize information about certain top-ics. Entity graphs have several common aspectswith Mind Maps and similar knowledge manage-ment structures, such as concept maps and concep-tual diagrams (Willis and Miertschin, 2006; Ep-pler, 2006), but are not confined to a tree struc-ture, hence they are more apt for sharing and bringprovenance in documents into the representation.

BEST is a biomedical entity search tool forknowledge discovery from biomedical literature(Lee et al., 2016). Although PubMed (the freepublic interface to MEDLINE, which provides ac-cess to bibliographic information in MEDLINE as

43

well as additional life science journals) providesa starting point to researchers, it only provideslists of relevant articles, leaving the task of extract-ing required information to the researchers them-selves. Existing context extraction systems havelimitations, such as 1) they provide outdated orincomplete results 2) the processing takes longer,and 3) most of them depend on conventionalsearch system structures to return relevant infor-mation. BEST is developed to face the challengesof getting relevant documents from biomedical lit-erature publications, addressing most challengesby directly returning ten relevant entities for auser’s query instead of a list of documents. Ourapproach differs from BEST in many aspects suchas 1) instead of relying on existing entity dictionar-ies, we use a semi-supervised entity recognitionsystem, 2) instead of returning a pre-computedlist of (indexed) results, our approach directs theresearcher in pinpointing the required informa-tion with directed visual exploration, i.e. a guidedsearch, 3) in addition to pre-defined entity typesor dictionaries, our approach allows researchersto define their own entity types without the needof advanced pre-processing or text mining knowl-edge, i.e. adaptive annotation.

Zhang and Elhadad (2013) propose an unsuper-vised approach for detecting biomedical entities.Instead of hand-crafted rules or annotated dataset,this work first identifies classes of entities basedon UMLS5 semantic groups in order to collectseed terms. Next, they extract chunks in order toautomatically determine named entity boundaries.Finally, they use a similarity based approach toautomatically group named entities into specificsemantic classes. While this approach is bene-ficial to identify biomedical entities, it has somedrawbacks compared to our approach: 1) their ap-proach depends on the collection of seed terms,2) it assumes that every biomedical document isavailable at all times.

3 Three Technologies for Entity-CentricInformation Access

While we target the biomedical domain, we willdescribe our previous work on other domains. Theentity types might change, but the principles of theentity graph is transferable across domains.

5Unified Medical Language System (UMLS) is a widelyused ontology of biomedical terms available at https://www.nlm.nih.gov/research/umls/.

3.1 Adaptively Annotating Entities withWEBANNO

Supervised named entity recognition (NER) sys-tems require a substantial amount of annotateddata to achieve high quality performance. Wepresent an interactive and adaptive annotation ap-proach. Instead of using a large sets of generalpurpose annotation corpora, we focus on specifi-cally collecting high quality sets of in-domain an-notations. In a case study for adaptive biomedicalentity annotation, we used the automation compo-nent of WEBANNO, which is a web-based anno-tation tool with an online machine learning com-ponent (Yimam et al., 2014). Annotations are cre-ated in an interactive and incremental approach.The process is interactive in such a way that thetool suggests annotations that can be accepted, re-jected or corrected by the annotator, whereby ma-chine learning model gets better in time.

3.2 Case Study: Entity Annotation

We conducted an annotation task for identify-ing medical entities using WEBANNO automa-tion, which is focused on B-Chronic lymphocyticleukemia (B-CLL). A medical expert selects do-main related abstracts for annotation. Unlike pre-vious approaches, the expert starts annotating textswithout prior determination of the entity types.During the annotation process, important entitiesare identified that could help retrieving relevantdocuments about B-CLL. In a first step, we an-notated five abstracts and use them for training toproduce suggestions.

The following entity types are identifiedthroughout the task: CELL, CONDITION,DISORDER, GENE, MOLECULE, PROTEIN,MOLECULAR PATHWAY and SUBSTANCE.We can see the following advantages of the adap-tive annotation approach: 1) it makes the anno-tation task faster by producing correct predictionsafter annotating only a few number of documents,2) the process helps the annotator to determineentity types unlike traditional approaches wherethe types are predefined by experts beforehand.This makes the identification of entity types morecomplete and robust (see details in Yimam et al.,2016a).

One of the typical relations between biomedicalentities describe the cause and effect of diseases.Again, supervised machine learning approachesfor automatic relation extraction requires more ef-

44

Figure 1: NEW/S/LEAK UI overview of the GENIA term annotation corpus. The example shows aB-CLL query and the graph shows involved DNA regions, “c-myc gene” is selected.

fort. For rapid annotation of relations, the relationcopy annotator in WEBANNO was used, where re-lation suggestions are provided as soon as annota-tors create the first relation annotations. This func-tionality has the following advantages: a) expertscan annotate entities as well as relation annota-tions at the same time, b) instances of the sameentity and relation are automatically suggested forthe running document as well as other unfinisheddocuments.

3.3 Collection Insights with NEW/S/LEAK

NEW/S/LEAK is a tool designed to support inves-tigative data journalism by exploring large sets ofinput documents, typically leaked documents (Yi-mam et al., 2016b). Named entities, such as per-sons, organizations, and locations, are automat-ically identified and ranked by importance. Aglobal graph of entities is constructed, which issubsequently used to display high-level interac-tions among those entities. The tool is intendedto guide investigative data journalists, by offeringa rich set of possible interactions, among whichare: full text search, entity merger or removal,document aggregation using meta-data, and manymore.

Journalists, as targeted user group, can browsethe document collection using the interactive inter-face (see Figure 1). It enables faceted documentexploration within several views: 1) the graphview shows named entities and their relations, 2)the document timeline view shows document fre-quency in different epochs, 3) the document viewis composed of the document list and a documenttext for reading, and 4) the metadata views in-

clude the search- and history views, which offerdifferent metadata for filtering relevant or irrele-vant documents.

The views are interactive, i.e. users can browseand explore the document collection on demand.The user starts with exploring entities and theirconnections in the graph view or by searchingfor entities and keywords. All interactions in theviews define a filter that constrains the current doc-ument set, which in turn changes the displayed in-formation content. User-selected entities are high-lighted in the documents.

Graph view: entities and their co-occurrencesThe graph view shows a set of entities as nodesand their connections as links. The node size de-notes the frequency of an entity, the node colordenotes the entity type. The number of shown en-tities can be set by the user individually for eachfacet (entity type). The edge thickness and labeldenotes the size and relation of co-occurrence ofthe involved entities within the documents.

Document timeline The document timeline liststhe number of documents in a specific epoch.Users can refine their search to see the documentdistribution over years, months or days.

Document view The document view shows a listof documents with their heading as selected by thecurrently active filters. For large document collec-tions, the documents are loaded on demand. Thedocument text view shows the text of the docu-ment, where the entities displayed in the graphare highlighted and underlined. The underlinecolor corresponds to the type of entity. Selectedentities in the graph are highlighted, which en-

45

ables a “close reading” mode to verify hypothesesformed in the so-called “distant reading” visual-ization (Moretti, 2007).

Metadata, search and history tracing This viewis mainly used to filter documents based on dif-ferent criteria such as metadata, entities, searchterms/key words, etc. The history tracer helps thejournalist to modify the search facets.

3.4 Personalized Knowledge Managementwith STORYFINDER

STORYFINDER is a toolkit that aims to keep in-formation managed which is found and processedwhile browsing the web (Remus et al., 2017). Themajor goal is to organize a personal history of bitsof information in form of entities and their rela-tions rather than a history of web pages while stillbeing able to find the source of a particular infor-mation bit in the respective web pages.

The system consists of three major components(cf. Fig. 2): 1) the Mozilla Firefox browser plu-gin, which: listens and reacts to a user’s actions;initiates the analysis of a currently visited web-page on the backend server; and provides a sidepane view to visualize the collected information;2) the server backend, which: performs the anal-ysis of a webpage; extracts metadata and stores theinformation for later access; and 3) the interactiveweb page, which: provides real-time access to thenew information and is embedded in the plugin’sside pane and can be accessed as a regular webpage too.

In its current form, STORYFINDER is targetedfor processing news texts; it automatically extractsnamed entities and draws an edge in a knowl-edge graph representation if two distinct entitiesco-occur in the same sentence (Fig. 3a).

The entities are subsequently highlighted withinthe current article for better visual appearance(Fig. 3b). The graph, i.e. the entities as nodes andtheir relations as edges are fully editable (Fig. 3c).

Due to the modular REST architecture regard-ing the NLP components within the backendserver, every automatic component is exchange-able, e.g. in order to automatically identify med-ical entities such as proteins, we merely need a re-liable protein tagger. In order to build such a tag-ger, annotated data is needed, which calls for anintegration with adaptive annotation(Section 3.1).

StoryfinderWebserver

SFWebserverSFWebserverNLPComponents

StoryfinderBrowserPlugin

Database

websocketsRESTservices

Figure 2: Schema of STORYFINDER’s compo-nents: The browser plugin, the server backend,and the interactive web page.

(a) The entity ‘Philipp Lahm’ is selected, other nodes andedges are grayed out except direct neighboring edges andnodes. Additionally an edge is hovered (rightmost thickedge).

(b) Screenshot of the default STORYFINDER plugin view. Acurrently visited webpage is analyzed, and the extracted en-tities are highlighted in an overlay. Entities are rendered in agraph together with their relations in the STORYFINDER web-page, which is shown in a side pane of the browser.

(c) Manually adding entities of arbitrary kind can be accom-plished via the plugin by right clicking any term or phrase.

Figure 3: Selected STORYFINDER screenshots.

4 Towards Information Managementwith Human-in-the-Loop

Within our newly started project, we will imple-ment a prototype that uses the entity graph repre-sentation as the primary means for visualizing and

46

Figure 4: An entity graph summarizing the literature research on B-CLL. The key symptoms, drugs andtreatments around the B-CLL are shown with their labeled relations. From labels, it becomes clear howentities relate to the topic, click on edges retrieves documents where connected concepts co-occur.

accessing biomedical research documents, inte-grating elements from prototypes described above.Key to the approach is to think the user in thecenter of the process and offer the user an adap-tive ML environment (Holzinger, 2016; Holzingeret al., 2017) where manual effort in terms of an-notating entities or classifying relations immedi-ately pays of in an improved representation in theEG. To exemplify how this could look like, Fig-ure 4 shows an example from leukemia research.Entities and their relations have been annotatedand semi-automatically recognized in a personalcollection of MEDLINE papers (Yimam et al.,2016a). Interacting with the network allows to findrespective documents.

With these actions, the biomedical researchercan utilize the entity graph as a visually support-ive notepad. Note that this goes well beyond atraditional notepad since collections of propertiesof entities usually do not get linked, and this alsogoes well beyond creativity tools such as e.g. mindmaps, since it does not only displays concepts, butfacilitates linking to source documents. Note fur-ther that while automatic methods aid the process,the biomedical researcher is in full control of theentity graph and can correct errors in the automaticprocessing in case they are relevant for the ques-tion of investigation.

Last, but not least, the individual entity graphscan be merged into a global structure by sharingamong researchers. Thus in our approach, theconceptualization of a domain will be modeledfrom BOTTOM-UP, and not from TOP-DOWN as inthe traditional knowledge management approach.Therefore, collaborative efforts of the crowd will

lead to construction of a global entity graph ofa domain in an incremental and problem-drivenway. The global graph can be used to softly sug-gest edge annotations while a user constructs anew graph, making the overall process of entitygraph construction backed up by a huge globalentity graph, which has provenance information(i.e. who has entered information, based on whichdocument) for mutual understanding. The globalgraph can also incorporate information from re-sources, such as MeSH, Gene and Disease Ontol-ogy. Challenges in the adaptation include a high-quality tagging of biomedical entities, preprocess-ing such as dependency parsing for relevant lan-guages, the design of the user interface and a re-sponsive online-adaptive machine learning model.

5 Conclusion

We proposed a new schema for entity-centric in-formation extraction and -access for biomedicalentities. We highlighted current drawbacks andnew challenges, and presented existing tools forinformation extraction (WEBANNO), visualiza-tion and navigation (NEW/S/LEAK), and person-alized information and knowledge management(STORYFINDER), which all together can be com-bined, adapted, and re-focused in order to pro-vide a data driven, bottom-up, conceptualizationapproach. Here, the Human-in-the-Loop is an in-tegral component, where not only the machinelearning models for information extraction aresupported and improved by users over time, thefinal entity graph becomes larger, cleaner, moreprecise and thus more usable for the users.

47

Acknowledgments

This research was supported by the Federal Min-istry for Education and Research (Germany) un-der grant no. 01DS17033 and by the VolkswagenFoundation under grant no. 90 847.

ReferencesTim Berners-Lee, James Hendler, and Ora Lassila.

2001. The Semantic Web. Scientific American284(5):28–37.

Chris Biemann. 2005. Ontology learning from text - asurvey of methods. LDV Forum (2):75–93.

Paul Buitelaar, Philipp Cimiano, and BernardoMagnini. 2005. Ontology learning from text : meth-ods, evaluation and applications, volume 123. IOSPress.

Vannevar Bush. 1945. As we may think. The AtlanticMonthly 176(1):101–108.

Philipp Cimiano, Christina Unger, and John McCrae.2014. Ontology-based interpretation of natural lan-guage. Synthesis Lectures on Human LanguageTechnologies. 7.2:1–178.

Thomas H. Davenport. 1994. Saving it’s soul: Human-centered information management. Harvard Busi-ness Review 72(2):119–131.

Martin J. Eppler. 2006. A comparison between conceptmaps, mind maps, conceptual diagrams, and visualmetaphors as complementary tools for knowledgeconstruction and sharing. Information Visualization5:202–210.

Thomas R. Gruber. 1995. Toward principles for the de-sign of ontologies used for knowledge sharing. Int.J. Hum.-Comput. Stud. 43(5-6):907–928.

Graeme Hirst. 2014. Overcoming Linguistic Barriersto the Multilingual Semantic Web, Berlin, Heidel-berg, pages 3–14.

Andreas Holzinger. 2016. Interactive machine learningfor health informatics: when do we need the human-in-the-loop? Brain Informatics 3(2):119–131.

Andreas Holzinger, Markus Plass, KatharinaHolzinger, Gloria Cerasela Crisan, Camelia-M.Pintea, and Vasile Palade. 2017. A glass-boxinteractive machine learning approach for solvingNP-hard problems with the human-in-the-loop.ArXiv e-prints .

Igor Jurisica, John Mylopoulos, and Eric Yu. 1999. Us-ing ontologies for knowledge management: An in-formation systems perspective. In Proceedings ofthe ASIS Annual Meeting. Washington, DC, USA,pages 482–496.

Sunwon Lee, Donghyeon Kim, Kyubum Lee, JaehoonChoi, Seongsoon Kim, Minji Jeon, Sangrak Lim,Donghee Choi, Sunkyu Kim, Aik-Choon Tan, andJaewoo Kang. 2016. Best: Next-generation biomed-ical entity search tool for knowledge discovery frombiomedical literature. PLOS ONE 11(10):1–16.

Douglas Lenat and Ramanathan V. Guha. 1990. Build-ing Large Knowledge Bases. Addison-Wesley Pub.Co, Reading, MA.

Franco Moretti. 2007. Graphs, maps, trees : abstractmodels for a literary history. Verso, London, UK.

Steffen Remus, Manuel Kaufmann, Kathrin Ballweg,Tatiana von Landesberger, and Chris Biemann.2017. Storyfinder: Personalized knowledge baseconstruction and management by browsing the web.In Proceedings of the 26th ACM International Con-ference on Information and Knowledge Manage-ment. Singapore, Singapore. To appear.

Rudi Studer, V. Richard Benjamins, and Dieter Fensel.1998. Knowledge engineering: Principles andmethods. Data Knowl. Eng. 25(1-2):161–197.

Cheryl L. Willis and Susan L. Miertschin. 2006. Mindmaps as active learning tools. Journal of computingsciences in colleges 21(4):266–272.

Seid Muhie Yimam, Chris Biemann, Ljiljana Majnaric,Sefket Sabanovic, and Andreas Holzinger. 2016a.An adaptive annotation approach for biomedical en-tity and relation recognition. Brain Informatics3(3):157–168.

Seid Muhie Yimam, Richard Eckart de Castilho, IrynaGurevych, and Chris Biemann. 2014. Automatic an-notation suggestions and custom annotation layers inWebAnno. In Proc. of ACL 2014: System Demon-strations. Baltimore, MD, USA, pages 91–96.

Seid Muhie Yimam, Heiner Ulrich, Tatiana von Lan-desberger, Marcel Rosenbach, Michaela Regneri,Alexander Panchenko, Franziska Lehmann, UliFahrer, Chris Biemann, and Kathrin Ballweg. 2016b.new/s/leak – information extraction and visualiza-tion for investigative data journalists. In Proceed-ings of ACL-2016 System Demonstrations. Associ-ation for Computational Linguistics, Berlin, Ger-many, pages 163–168.

Shaodian Zhang and Noemie Elhadad. 2013. Unsu-pervised biomedical named entity recognition. J. ofBiomedical Informatics 46(6):1088–1098.

48

Documents

Entity-Centric Information Access with Human in the Loop ... · of these tools to biomedical data, which is subject to a recently started project on biomedical knowledge acquisition