IEEE TRANSACTIONS ON SERVICES COMPUTING 1 ...static.tongtianta.site/paper_pdf/62a9a702-5ce4-11e9-848f...querying of this mined and integrated data, i.e., the knowledge graph, we propose

1939-1374 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSC.2017.2711600, IEEETransactions on Services Computing

IEEE TRANSACTIONS ON SERVICES COMPUTING 1

Building and Querying an Enterprise KnowledgeGraph

Dezhao Song, Frank Schilder, Shai Hertz, Giuseppe Saltini, Charese Smiley, Phani Nivarthi, Oren Hazai,Dudi Landau, Mike Zaharkin, Tom Zielund, Hugo Molina-Salgado, Chris Brew and Dan Bennett

Abstract—Given data reaching an unprecedented amount, coming from diverse sources, and covering a variety of domains inheterogeneous formats, information providers are faced with the critical challenge to process, retrieve and present information to theirusers in order to satisfy their complex information needs. In this paper, we present Thomson Reuters’ effort in developing a family ofservices for building and querying an enterprise knowledge graph in order to address this challenge. We first acquire data from varioussources via different approaches. Furthermore, we mine useful information from the data by adopting a variety of techniques, includingNamed Entity Recognition and Relation Extraction; such mined information is further integrated with existing structured data (e.g., viaEntity Linking techniques) in order to obtain relatively comprehensive descriptions of the entities. By modeling the data as an RDFgraph model, we enable easy data management and the embedding of rich semantics in our data. Finally, in order to facilitate thequerying of this mined and integrated data, i.e., the knowledge graph, we propose TR Discover, a natural language interface thatallows users to ask questions of our knowledge graph in their own words; these natural language questions are translated intoexecutable queries for answer retrieval. We evaluate our services, i.e., named entity recognition, relation extraction, entity linking andnatural language interface, on real-world datasets, and demonstrate and discuss their practicability and limitations.

Index Terms—Knowledge Graph, Data Acquisition, Data Transformation, Data Modeling, Data Interlinking, Natural Language Interface

F

1 INTRODUCTION

Knowledge workers, such as scientists, lawyers, traders oraccountants, have to deal with a greater than ever amountof data with an increased level of variety. Their informationneeds are often focused on entities and their relations, ratherthan on documents. To satisfy these needs, informationproviders must pull information from wherever it happensto be stored and bring it together in a summary result. As aconcrete example, suppose a user is interested in companieswith the highest operating profit in 2015 currently involvedin Intellectual Property (IP) lawsuits. In order to answer thisquery, one needs to extract company entities from free textdocuments, such as financial reports and court documents,and then integrate the information extracted from differentdocuments about the same company together.

There are three main challenges for providing informa-tion to knowledge workers so that they can receive theanswers they need:

1) How to process and mine useful information from largeamount of unstructured and structured data

2) How to integrate such mined information for the sameentity across disconnected data sources and store themin a manner for easy and efficient access

• Dan Bennett, Shai Hertz, Giuseppe Saltini, Phani Nivarthi, Oren Hazai.Dudi Landau, and Mike Zaharkin are with the Central TechnologyPlatform Group of Thomson Reuters.

• Dezhao Song, Frank Schilder, Charese Smiley, Tom Zielund, HugoMolina-Salgado and Chris Brew are with the Research and DevelopmentGroup of Thomson Reuters.

• Correspondence to Dan Bennett, Frank Schilder and Dezhao Song. E-mail:{dan.bennett, frank.schilder, dezhao.song}@tr.com

Manuscript received March 31, 2016

3) How to quickly find the entities that satisfy the infor-mation needs of today’s knowledge workers.

A knowledge graph is a general concept of representingentities and their relationships and there have been variousefforts underway to create knowledge graphs that connectentities with each other. For instance, the Google KnowledgeGraph consists of around 570 million entities as of 2014 [1].In this paper, we describe Thomson Reuters’ approach toaddressing the three challenges introduced above. WithinThomson Reuters, data may be produced manually, e.g.,by journalists, financial analysts and attorneys, or automat-ically, e.g., from financial markets and cell phones. Further-more, the data we have covers a variety of domains, suchas media, geography, finance, legal, academia and entertain-ment. In terms of the format, data may be structured (e.g.,database records) or unstructured (e.g., news articles, courtdockets and financial reports).

Given this large amount of data available, from diversesources and about various domains, the biggest challenge ishow we could structure this data in order to best supportusers’ information needs. First of all, we need to be able toingest and consume the data in a scalable manner. This dataingestion process needs to be robust enough to be capableof processing all types of data (e.g., relation databases,tabular files, free text documents and PDF files) that maybe acquired from various data sources.

Furthermore, although much of the data is alreadyin structured formats (e.g., database records and state-ments represented using Resource Description Framework1

(RDF)), a significant amount of the data is still free text.

1. https://www.w3.org/TR/rdf11-primer/




Such unstructured data may include patent filings, financialreports, academic publications, etc. In order to be able tobest satisfy users’ information needs, it is critical to addstructure to these free text documents. Additionally, wecannot leave this data sitting in separated “silos”; it isimportant to integrate the data in order to facilitate down-stream applications, such as search and data analytics.

Data modeling and storage is another important partof our knowledge graph pipeline. A data modeling mech-anism should be flexible enough to allow scalable datastorage, easy data update and schema flexibility. The Entity-Relationship (ER) modeling approach, for example, is amature technique; however, we find that it is difficult torapidly accommodate new facts in this model. Invertedindices allow efficient retrieval of the data; however thebiggest drawback is that it only supports keyword queriesthat may not be sufficient to satisfy complex informationneeds. RDF is a flexible model for representing data in theformat of tuples with three elements and no fixed schemarequirement. An RDF model also allows for a more expres-sive semantics of the modeled data that can be used forknowledge inference.

Finally, the ingested, transformed, integrated and storeddata will only become useful, if answers can be efficientlyretrieved by our users in an intuitive manner. Currently,the mainstream approaches to searching for information arekeyword queries and specialized query languages (e.g., SQLand SPARQL2). The former are not able to represent theexact query intent of the user, in particular for questionsinvolving relations or other restrictions such as temporalconstraints (e.g., IBM lawsuits since 2014); while the latterrequire users to become experts in specialized, complicated,and hard to write query languages. Thus, both mainstreamtechniques create severe barriers between data and users,and do not serve well the goal of helping users to effectivelyfind the information they are seeking in today’s hypercom-petitive, complex, and Big Data world.

Based upon the discussion above, in this paper, wepresent our effort in building and querying an enterpriseknowledge graph3, with the following major contributions:

• We first present our data acquisition process from var-ious sources. The acquired data is stored in a raw datastore, which may include relational databases, CommaSeparated Value (CSV) files, and so on.

• Next, we apply our Named Entity Recognition (NER),relation extraction and entity linking techniques in or-der to mine valuable information from the acquireddata. Such mined and integrated data then constituteour knowledge graph.

• Furthermore, we propose TR Discover, a natural lan-guage interface that enables users to intuitively searchfor information from our knowledge graph using theirown words.

• Finally, we evaluate our NER, relation extraction andentity linking techniques on a real-world news corpusand show that our techniques are able to achieve com-petitive performance. We also evaluate TR Discover on

2. https://www.w3.org/TR/sparql11-overview/3. See our technical report for more details: https://goo.gl/P0xgRr

a graph of 2.2 billion triples by using 10K randomlygenerated questions of different levels of complexity.

We organize the rest of the paper as follows. We firstpresent an overview of our service framework in Section 2.Next, we present our data acquisition, transformation andinterlinking (i.e., NER, relation extraction and entity linking)processes in Section 3. We then describe how we model andstore such processed data in Section 4. Section 5 presentsour natural language interface to querying the knowledgegraph. In Section 6, we evaluate the various components ofour system. We discuss related work in Section 7, describelessons learned in Section 8, and conclude in Section 9.

2 SERVICE FRAMEWORK OVERVIEW

Figure 1 demonstrates the overall architecture of our system.In this diagram, the solid lines represent our batch data pro-cessing, whose result will be used to update our knowledgegraph; the dotted lines represent the interactions betweenusers and our various services. For services that are publiclyavailable, we have published user guide and code examplesin different programming languages4.

Fig. 1. System Architecture

First of all, during our data acquisition and ingestionprocesses (Section 3.1), we consume data from varioussources, including live data feeds, web pages and other non-textual data (e.g., PDF files). For example, for PDF files,we apply commercial Optical Character Recognition (OCR)software to obtain the text from them. We also analyze webpages and extract their textual information.

Next, given a document in the raw data, a single POSTrequest is issued to our core service for entity recognitionand relation extraction. Furthermore, our service performsdisambiguation within the recognized entities. For example,if two recognized entities “Tim Cook” and “Timothy Cook”have been determined by our system to both refer to theCEO of Apple Inc., they will be grouped together as onerecognized entity in the output. Finally, our system willtry to link each of the recognized entities to our existingknowledge graph. If a mapping between a recognized entity

4. https://permid.org/




and one in the knowledge graph is found, in the output ofthe core service, the recognized entity will be assigned theexisting entity ID in our knowledge graph.

The entity linking service can also be called separately.It takes a CSV file as input where each line is a single entitythat will be linked to our knowledge graph. In our currentdeployment, each CSV file can contain up to 5,000 entities.

While performing the above discussed services, with ourRDF model (Section 4), we store our knowledge graph, i.e.,the recognized entities and their relations, in an invertedindex for efficient retrieval with keyword queries (i.e., theKeyword Search Service in Figure 1) and also in a triplestore in order to support complex query needs.

Finally, in order to support our natural language inter-face, TR Discover (Section 5), we have developed internalprocesses to retrieve the entities and relations from theknowledge graph in order to build the necessary resourcesfor the relevant sub-modules (e.g., a lexicon for questionunderstanding). Users can then ask a natural languagequestion through a Web interface.

3 DATA ACQUISITION, TRANSFORMATION AND IN-TERLINKING

In this section, we present an overview of our data, includ-ing its acquisition and curation process. Such collected andcurated data is then used to build our knowledge graph.

3.1 Data Source and Acquisition

In general, our data covers a variety of industries, includingFinancial & Risk (F&R), Tax & Accounting, Legal, and News.Each of these four major data categories can be furtherdivided into various sub-categories. For instance, our F&Rdata ranges from Company Fundamentals to Deals andMergers & Acquisitions. Our professional customers rely onsuch rich datasets in order to find trusted answers. Table 1provides a high-level summary of our data space.

TABLE 1An Overview of Thomson Reuters Data Space

Industry Description

Financial & Risk (F&R)

F&R data primarily consists of structured datasuch as intra and end-of-day time series, CreditRatings, Fundamentals, alongside less struc-tured sources, e.g., Broker Research and News.

Tax & Accounting Here, the two biggest datasets are highly struc-tured tax returns and tax regulations.

Legal

Our legal content has a US bias and is mostlyunstructured or semi-structured. It rangesfrom regulations to dockets, verdicts to casedecisions from Supreme Court, alongside nu-merous analytical works.

Reuters News

Reuters delivers more than 2 million newsarticles and 0.5 million pictures every year. Thenews articles are unstructured but augmentedwith certain types of metadata.

In order to acquire the necessary data in the abovementioned domains, we adopt a mixture of different ap-proaches, including manual data entry, web scraping, feedconsumption, bulk upload and OCR. The acquired data isfurther curated at different levels according to the productrequirements and the desired quality level. Data curationmay be done manually or automatically.

Although our acquired data contains a certain amountof structured data (e.g., database records, RDF triples, CSVfiles, etc.), the majority of our data is unstructured (e.g.,Reuters news articles). Such unstructured data contains richinformation that could be used to supplement existing struc-tured data. Because our data comes from diverse sourcesand covers various domains, including Finance, Legal, In-tellectual Property, Tax & Accounting, etc., it is very likelythat the same entity (e.g., organization, location, judge,attorney and law firm) could occur in multiple sources withcomplementary information. For example, “Company A”may exist in our legal data and is related to all its legal cases;while at the same time, this company may also appear in ourfinancial data with all its Merger & Acquisition activities.Therefore, being able to interlink the different occurrencesof the same entity across a variety of data sources is criticialfor providing our users a comprehensive view of the entitiesthey are interested in. An additional requirement is that wekeep the graph up to date with the fast changing nature ofmuch of our source content.

In order to be able to mine information from unstruc-tured data and to interlink entities across diverse datasources, we have devoted a significant amount of effort todeveloping tools and capabilities for automatic informationextraction and data interlinking. For structured data, we linkeach entity in the data to the relevant nodes in our graphand update the information of the nodes being linked to. Forunstructured data, we first perform information extractionto extract the entities and their relationships with otherentities; such extracted structured data is then integratedinto our knowledge graph.

3.2 Named Entity Recognition and Relation Extraction

Named Entity Recognition. Given a free text document,we first perform named entity recognition (NER) on thedocument to extract various types of entities, includingcompanies, people, locations, events, etc. We accomplishthis NER process by adopting a set of in-house naturallanguage processing techniques that include both rule-basedand machine learning algorithms. The rule-based solutionuses well-crafted patterns and lexicons to identify bothfamiliar and unfamiliar entity names.

Our machine learning-based NER consists of two parts,both of which are based on binary classification and evolvedfrom the Closed Set Extraction (CSE) system. CSE originallysolved a simpler version of the NER problem: extractingonly known entities, without discovering unfamiliar ones.This simplification allows it to take a different algorithmicapproach, instead of looking at the sequence of words. First,it searches the text for known entity aliases, which becomeentity candidates. Then it uses a binary classification task todecide whether each candidate actually refers to an entityor not, based on its context and on the candidate alias. Thesecond component tries to look for unfamiliar entity names,by creating candidates from patterns, instead from lexicons.

Both components use logistic regression for the classi-fication problem, using LIBLINEAR’s implementation [2].We employ commonly adopted features for our machinelearning-based NER algorithm [3]: parts of speech, sur-rounding words, various lexicons and gazetteers (company




names, people names, geographies & locations, companysuffixes, etc.). We also designed special features in order todeal with specific sources of interest; such special featuresare aimed at detecting source specific patterns.

Relation Extraction. The core of this approach is amachine learning classifier that predicts the probabilityof a possible relationship for a given pair of identifiedentities in a given sentence. This classifier uses a set ofpatterns to exclude noisy sentences, and then extracts a setof features from each sentence. We employ context-basedfeatures, such as token-level n-grams and patterns, similarto those described in [4]. Other features are based on varioustransformations and normalizations that are applied to eachsentence (such as replacing identified entities by their type,omitting irrelevant sentence parts, etc.). In addition, the clas-sifier also relies on information available from our existingknowledge graph. For instance, when trying to identify therelationship between two identified companies, the industryinformation (i.e., healthcare, finance, automobile, etc.) ofeach company is retrieved from the knowledge graph andused as a feature. We also use past data to automaticallydetect labeling errors in our training set, which improvesour classifier over time.

The algorithm is precision oriented in order to avoidintroducing too many false positives into our knowledgegraph. Currently, our relation extraction is only applied tothe recognized entity pairs in each document, i.e., we donot try to relate two entities from two different free textdocuments. The relation extraction process runs as a dailyroutine on live document feeds. For each pair of entities,our system may extract multiple relationships; only thoserelationships with a confidence score above a pre-definedthreshold are then added to our knowledge graph. Ournamed entity recognition and relation extraction APIs, alsoknown as Intelligent Tagging, are publicly available5.

3.3 Entity Linking

Being able to mine information from unstructured data isimportant; yet, it is equally important to be able to integratesuch mined information with existing structured data inorder to provide our users with comprehensive informationabout the entities. We have developed several tools to linkentities to nodes in our knowledge graph, primarily basedon matching the attribute values of the nodes in the graphand that of a new entity. These tools adopt a generic butalso customizable algorithm, thus could be adjusted fordifferent specific use cases. In general, given an entity, wefirst adopt a blocking technique in order to find candidatenodes that the given entity could possibly be linked to.Blocking can be treated as a filtering process and is used toidentify nodes that are promising candidates for linking ina lightweight manner [5]. The actual and expensive entitymatching algorithms are then only applied between thegiven entity and the resulting candidate nodes.

Next, we compute a similarity score between each of thecandidate nodes and the given entity using an SVM clas-sifier that is trained with our surrogate learning technique[6]. Surrogate learning allows the automatic generation of

5. http://www.opencalais.com/opencalais-api/

training data from the datasets being matched. In surro-gate learning, we find a feature that is class-conditionallyindependent of the other features and whose high valuescorrelate with true positives and low values correlate withtrue negatives. Then, this surrogate feature is used to auto-matically label training examples in order to avoid manuallylabeling a large number of training data.

An example of a surrogate feature is the use of thereciprocal of the block size: 1

block size . In this case, for ablock containing just one candidate that is most likely amatch (true positive), the value for this surrogate featurewill be 1.0; while for a big block containing a matchingentity and many non-matching entities (true negatives), thevalue of the surrogate feature will be small. Therefore, onaverage, a high value of this surrogate feature (close to 1.0)will correlate to true positives and a low value (�1.0) willcorrelate to true negatives.

The features needed for the SVM model are extractedfrom all pairs of comparable attributes between the givenentity and a candidate node. For example, the attributes“first name” and “given name” are comparable. Based uponsuch calculated similarity scores, the given entity is linkedto the candidate node that it has the highest similarity scorewith, if their similarity score is also above a pre-definedthreshold. The blocking phase is tuned towards high recall,i.e., we want to make sure that the blocking step will be ableto cover the node in the graph that a given entity shouldbe linked to, if such a node exists. Then, the actual entitylinking step ensures that we only generate a link when thereis sufficient evidence in order to achieve decent precision,i.e., the similarity between the given entity and a candidatenode is above a threshold [7]. Our entity linking systemvaries in the way it implements each of the two steps. Forexample, it may be configured to use different attributes andtheir combinations for blocking; it also provides differentsimilarity algorithms that can be used to compute featurevalues. Our entity linking APIs are publicly available6.

Figure 2 demonstrates an example of our NER, entitylinking, and relation extraction process. First, with our NERtechniques, two companies, “Denso Corp” and “Honda”,are identified; each of them is also given a temporary ID.Next, both of the recognized companies are linked to nodesin our knowledge graph and each of them is now associatedwith the corresponding Knowledge Graph ID (KGID). Fur-thermore, a relationship “supplier” (i.e., “Denso Corp” and“Honda” have a supply chain relationship between them)is extracted between them. Finally, the newly extractedrelationship is added to our knowledge, since the score ofthis relationship (0.95) is above the pre-defined threshold.

4 DATA MODELING AND PHYSICAL STORAGE

There are a variety of mechanisms for representing the data,including the Entity-Relation (ER) model (i.e., for relationaldatabases), plain text files (e.g., in tabular formats, suchas CSV), or inverted indices (in order to facilitate efficientretrieval by using keyword queries), etc. Plain text files arethe easiest to store the data. However, simply putting thedata into files would not allow the users to conveniently

6. https://permid.org/match




Fig. 2. An Example of the Named Entity Recognition, Entity Linking,Relation Extraction and Knowledge Graph Update Process

obtain the information they are looking for from a massivenumber of files. Although relational database is a maturetechnique and users can retrieve information by using ex-pressive SQL queries, a schema (i.e., the ER model) hasto be defined ahead-of-time in order to represent, storeand query the data. This modeling process can be rathercomplicated and time-consuming, particularly for compa-nies that have diverse types of datasets from various datasources. Furthermore, when new data keeps coming in, itmay be necessary to keep revising the model and even re-modeling the data, which could be expensive in terms ofboth time and human effort. Data can also be used to buildinverted indices for efficient retrieval. However, the biggestdrawback of inverted indices is that users can only searchfor information with simple keyword queries; while in real-world scenarios, many user search needs would be bettercaptured by adopting more expressive query languages.

4.1 Modeling Data as RDFOne emerging data representation technique is the ResourceDescription Framework (RDF). RDF is a graph based datamodel for describing entities and their relationships on theWeb. Although RDF is commonly described as a directedand labeled graph, many researchers prefer to think of it asa set of triples, each consisting of a subject, predicate andobject in the form of <subject, predicate, object>.

Triples are stored in a triple store and are queried withthe SPARQL query language. Compared to both invertedindices and plain text files, triple stores and the SPARQLquery language enable users to search for information withexpressive queries in order to satisfy complex user needs.Although a model is required for representing data in triples(similar to relational databases), RDF enables the expressionof rich semantics and supports knowledge inference.

Another big advantage of adopting an RDF model is thatit enables easier data deletion and update. Traditional datastorage systems are “schema on write”, i.e., the structureof the data (the data model) is decided at design timeand any data that does not fit this structure is lost wheningesting the data. In contrast, “schema on read” systemsattempt to capture everything and then apply computationhorsepower to enforce a schema when the data is retrieved.An example would be the Elastic/Logstash/Kibana stack7

that does not enforce any schema when indexing the databut then tries to interpret one from the built indices. The

7. https://www.elastic.co/products

tradeoff is future-proofing and nimbleness at the expenseof (rapidly diminishing) computing and storage. RDF sitsat a unique intersection of the two types of systems. Firstof all, it is “schema on write” in the sense that there is avalid format for data to be expressed as triples. On the otherhand, the boundless nature of triples means that statementscan be easily added/deleted/updated by the system andsuch operations are hidden to users. Therefore, adopting anRDF model for data representation fits our needs well.

While building our knowledge graph, we have designedan RDF model for our data. Our model contains classes (e.g.,organizations and people) and predicates (the relationshipsbetween classes, e.g., “works for” and “is a board memberof”). For brevity, we only show a snippet of our entire modelin Figure 3. Here, the major classes include Organization,Legal Case, Patent and Country. Various relationships alsoexist between these classes: “involved in” connects a legalcase and an organization, “presided over by” exists betweena judge and a legal case, patents can be “granted to” orga-nizations, an organization can “develop” a drug which “istreatment for” one or more diseases. Over time, the modelwill evolve to accommodate new domains.

Fig. 3. Ontology Snippet of Thomson Reuters’s Knowledge Graph

4.2 Data StorageIn our current implementation, we store the triples in twoways. We index the triples on their subject, predicate andobject respectively with the Elastic search engine. We alsobuild a full-text search index on objects that are literalvalues, where such literal values are tokenized and treatedas terms in the index. This enables fast retrieval of the datawith simple keyword queries. Additionally, we store all thetriples in a triple store in order to support search withcomplex SPARQL queries. Currently, our TR knowledgegraph manages about 5 billion triples; however, this onlyrepresents a small percentage of our data and the numberof triples is expected to grow rapidly over time.

In addition to the three basic elements in a triple (i.e.,subject, predicate and object), a fourth element can also beadded, turning a triple to a quad8. This fourth element is

8. https://www.w3.org/TR/n-quads/




generally used to provide provenance information of thetriple, such as its source [8] and trustworthiness [9]. Suchprovenance information can be used to evaluate the qualityof a triple. For example, if a triple comes from a reputablesource, then it may generally have a higher quality level. Inour current system, we use the fourth element to track thesource and usage information of the triples. The followingexamples show the usage of this fourth element:

• <Microsoft, has address, Address1, Wikipedia>, indi-cating that this triple comes from Wikipedia

• <Jim Hendler, works for, RPI, 2007 to present>, show-ing the time period that Jim Hendler works for RPI.

5 QUERYING THE KNOWLEDGE GRAPH WITH NAT-URAL LANGUAGE

In previous sections, we have presented a Big Data frame-work and infrastructure for building an enterprise knowl-edge graph. However, given the built graph, one importantquestion is how to enable end users to retrieve the data fromthis graph in an intuitive and convenient manner. Technicalprofessionals, such as database experts and data scientists,may simply employ SPARQL queries to access this informa-tion. But non-technical information professionals, such asjournalists, financial analysts and patent lawyers, who can-not be expected to learn such specialized query languages,still need a fast and effective means for accessing the datathat is relevant to the task at hand.

Keyword-based queries have been frequently adoptedto allow non-technical users to access large-scale RDF data[10], and can be applied in a uniform fashion to informationsources that may have wildly divergent logical and physicalstructure. But they do not always allow precise specificationof the user’s intent, so the returned result sets may beunmanageably large and of limited relevance. However, itwould be really difficult for non-technical users to learnspecialized query languages (e.g., SPARQL) and to keep upwith the pace of the development of new query languages.

In order to enable non-technical users to intuitively findthe exact information they are seeking, we developed TRDiscover, a natural language interface that is designed tobridge the gap between keyword-based search and struc-tured query. In our system, the user creates natural languagequestions, which are mapped into a logic-based interme-diate language. A grammar defines the options availableto the user and implements the mapping from Englishinto logic. An auto-suggest mechanism guides the usertowards questions that are both logically well-formed andlikely to elicit useful answers from a knowledge base. Asecond translation step then maps from the logic-basedrepresentation into a standard query language (SPARQL inthis paper), allowing the translated query to rely on robustexisting technology. Since all professionals can use naturallanguage, we retain the accessibility advantages of keywordsearch, and since the mapping from the logical formalism tothe query language is information-preserving, we retain theprecision of query-based information access. We present thedetails of TR Discover in the rest of this section.

5.1 Question Understanding

We use a feature-based context-free grammar (FCFG) forparsing natural language questions. Our FCFG consists ofphrase structure rules (i.e., grammar rules) on non-terminalnodes and lexical entries (i.e., lexicon) for leaf nodes. Thelarge majority of the phrase structure rules are domainindependent allowing the grammar to be portable to newdomains. The following shows a few examples of our gram-mar rules: G1 - G3. Specifically, Rule G3 indicates that a verbphrase (VP) contains a verb (V) and a noun phrase (NP).

G1: NP→ NG2: NP→ NP VPG3: VP→ V NP

Furthermore, as for the lexicon, each entry in the FCFGlexicon contains a variety of domain-specific features thatare used to constrain the number of parses computed by theparser preferably to a single, unambiguous parse. L1-L3 areexamples of lexical entries.

L1: N[TYPE=drug, NUM=pl, SEM=<λx.drug(x)>]→ ‘drugs’L2: V[TYPE=[drug,org,dev], SEM=<λX x.X(λy.dev org drug(y,x))>,TNS=past, NUM=?n]→ ‘developed by’L3: V[TYPE=[org,country,hq], NUM=?n]→ ‘headquartered in’

Here, L1 is the lexical entry for the word, drugs, indicatingthat it is of TYPE drug, is plural (“NUM=pl”), and hasthe semantic representation λx.drug(x). Verbs (V) have anadditional feature tense (TNS), as shown in L2. The TYPEof verbs specify both the potential subject-TYPE and object-TYPE. With such type constraints, we can then license thequestion drugs developed by Merck while rejecting nonsen-sical questions like drugs headquartered in the U.S. on thebasis of the mismatch in semantic type. A general formfor specifying the subject and object types for verbs isas following: TYPE=[subject constraint, object constraint,predicate name].

Disambiguation relies on the unification of featureson non-terminal syntactic nodes. We mark prepositionalphrases (PPs) with features that determine their attachmentpreference. For example, we specify that the prepositionalphrase for pain must attach to an NP rather than a VP;thus, in the question Which companies develop drugs for pain?,“for pain” cannot attach to “develop” but must attachto “drugs”. Additional features constrain the TYPE of thenominal head of the PP and the semantic relationship thatthe PP must have with the phrase to which it attaches. Thisapproach filters out many of the syntactically possible butundesirable PP-attachments in long queries with multiplemodifiers, such as companies headquartered in Germany devel-oping drugs for pain or cancer. When a natural language ques-tion has multiple parses, we always choose the first parse.Future work may include developing ranking mechanismsin order to rank the parses of a question.

The outcome of our question understanding process isa logical representation of the given natural language ques-tion. Such logical representation is then further translated(to be introduced in Section 5.3) into an executable query(SPARQL in this paper) for retrieving the query results.Adopting such intermediate logical representation enablesus to have the flexibility to further translate the logicalrepresentation into different types of executable queries inorder to support different types of data stores (e.g., rela-tional database, triple store, inverted index, etc.).




5.2 Enabling Question Completion with Auto-suggest

Traditional question answering systems often require usersto enter a complete question. However, it may be difficult fornovice users to do so, e.g., due to the lack of familiarity andan incomplete understanding of the underlying data. Oneunique feature of TR Discover is that it provides suggestionsin order to help users to complete their questions. Theintuition here is that our auto-suggest module guides usersin exploring the underlying data and completing a questionthat can be potentially answered with the data. UnlikeGoogle’s query auto-completion that is based on querylogs [11], our suggestions are computed based upon therelationships and entities in our knowledge graph and byutilizing the linguistic constraints encoded in our grammar.

Our auto-suggest module is based on the idea of left-corner parsing. Given a query segment qs (e.g., drugs, devel-oped by, etc.), we find all grammar rules whose left cornerfe on the right side matches the left side of the lexical entryof qs. We then find all leaf nodes in the grammar that canbe reached by using the adjacent element of fe. For allreachable leaf nodes (i.e., lexical entries in our grammar),if a lexical entry also satisfies all the linguistic constraints,we then treat it as a valid suggestion.

There are (at least) two ways of using the auto-suggestfacility. On one hand, users may be interested in broad,exploratory questions; however, due to lack of familiaritywith the data, guidance from our auto-suggest module willbe needed to help this user build a valid question in orderto explore the underlying data. In this situation, users canwork in steps: they could type in an initial question segmentand wait for the system to provide suggestions. Then, userscan select one of the suggestions to move forward. Byrepeating this process, users can build well-formed naturallanguage questions (i.e., questions that are likely to beunderstood by our system) in a series of small steps guidedby our auto-suggest. Figures 4(a) to 4(c) demonstrate thisquestion building process. Assuming that User A starts bytyping in dr, drugs will then appear as a possible completion.User A can either continue typing drugs or select it from thedrop down list. Upon selection, suggested continuations tothe current question segment, such as using and developed by,are then provided to User A. Suppose our user is interestedin exploring drug manufacturers and thus selects developedby. In this case, both the generic type, companies, along withspecific company instances like Pfizer Inc and Merck & CoInc are offered as suggestions. User A can then select PfizerInc to build the valid question, drugs developed by Pfizer Incthereby retrieving answers from our knowledge graph.

Alternatively, users can type in a longer string, withoutpausing, and our system will chunk the question and tryto provide suggestions for users to further complete theirquestion. For instance, given the following partial questioncases filed by Microsoft tried in ..., our system first tokenizesthis question; then starting from the first token, it finds theshortest phrase (a series of continuous tokens) that matchesa suggestion and treats this phrase as a question segment. Inthis example, cases (i.e., legal cases) will be the first segment.As the question generation proceeds, our system finds sug-gestions based on the discovered question segments, andproduces the following sequence of segments: cases, filed by,

(a) “dr” is typed (b) “drugs” is selected and sug-gestions are provided

(c) “developed by” ispicked and “Pfizer Inc”can be chosen to completea question

Fig. 4. An Example of Auto-suggest in TR Discover

Microsoft, and tried in. At the end, the system knows thattried in is likely to be followed by a phrase describing a ju-risdiction, and is able to offer corresponding suggestions tothe user. In general, an experienced user might simply typein cases filed by Microsoft tried in; while first-time users whoare less familiar with the data can begin with the stepwiseapproach, progressing to a more fluent user experience asthey gain a deeper understanding of the underlying data.

We rank the suggestions based upon statistics extractedfrom our knowledge graph. Each node in our knowledgegraph corresponds to a lexical entry (i.e., a potential sug-gestion) in our grammar (i.e., FCFG), including entities(e.g., specific drugs, drug targets, diseases, companies, andpatents), predicates (e.g., developed by and filed by), andgeneric types (e.g., Drug, Company, Technology, etc.). Usingour knowledge graph, the ranking score of a suggestion isdefined as the number of relationships it is involved in. Forexample, if a company filed 10 patents and is also involvedin 20 lawsuits, then its ranking score will be 30. Our currentranking is computed only based upon the data; in futurework, we plan to explore how to tune the system’s behaviorto a particular individual user by mining our query logs forsimilar queries previously made by that user.

5.3 Question Translation and ExecutionIn contrast to other natural language interfaces [12], ourquestion understanding module first maps a natural lan-guage question to its logical representation (Section 5.1);and in this paper, we adopt First Order Logic (FOL). TheFOL representation of a natural language question is furthertranslated to an executable query. This intermediate logicalrepresentation provides us the flexibility to develop differ-ent query translators for various types of data stores.

There are two steps in translating an FOL representationto an executable query. In the first step, we parse the FOL




Fig. 5. The Parse Tree for the FOL of the Question “Drugs developed by Merck”

representation into a parse tree by using an FOL parser.This FOL parser is implemented with ANTLR [13] (a parserdevelopment tool). The FOL parser takes a grammar and anFOL representation as input, and generates a parse tree forthe FOL representation. Figure 5 shows the parse tree of theFOL for the question “Drugs developed by Merck”.

We then perform an in-order traversal (with ANTLR’sAPIs) of the FOL parse tree and translate it to an exe-cutable query. While traversing the tree, we put all theatomic query constraints (e.g., “type(entity0, company)”,indicating that “entity0” represents a company entity, and“pid(entity0, 4295904886)”, showing the internal ID of theentity represented by “entity0”) and the logical connectors(i.e., “and” and “or”) into a stack. When we finish traversingthe entire tree, we pop the conditions out of the stack tobuild the correct query constraints; predicates (e.g., “de-velop org drug” and “pid”) in the FOL are also mapped totheir corresponding predicates in our RDF model (Section4.1) in order to formulate the final SPARQL query. We runthe translated SPARQL queries against an instance of thefree version of GraphDB [14], a state-of-the-art triple storefor storing triple data and for executing SPARQL queries.

As a concrete example, the following summarizes thetranslation from a natural language question to a SPARQLquery via a FOL representation:

Natural Language Question: Drugs developed by Merck

FOL: all x.(drug(x) → (develop_org_drug(entity0,x) &type(entity0,Company) & pid(entity0,4295904886)))

SPARQL Query:PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX example: <http://www.example.com#>select ?xwhere {?x rdf:type example:Drug .example:4295904886 example:develops ?x .}

6 EVALUATION

6.1 Evaluation of Data Transformation and Interlinking

Here, we evaluate our named entity recognition, relation ex-traction, and entity linking services, i.e., Intelligent Tagging.

Dataset. Named entity recognition is evaluated sepa-rately for Company, Person, City and Country; entity link-ing is evaluated on Company and Person entities. Table 2

TABLE 2Statistics of NER and Entity Linking Evaluation Datasets

Task Entity Type |Document| |Mention|

Entity

Company 1,496 4,450

Recognition

Person 600 787City 100 101

Country 2,000 1,835

Entity Linking Company 1,000 673Person 100 156

shows the statistics of our evaluation datasets for NER andentity linking. All documents were randomly sampled froma large news corpus. For NER, each selected document wasannotated by one Thomson Reuters employee. It should benoted that these entity mention counts are at the documentlevel, and not the instance level. For example, if a companyappeared in 3 different documents and 5 times in each,we count it as 3 company mentions (instance level countwould have been 15, unique companies count would havebeen 1). For entity linking, the randomly selected entities aremanually resolved to entities in our knowledge graph.

We also evaluate our machine learning-based relationextraction algorithm. We present the results on two differenttypes of relations: “Supply Chain” and “Merger & Acquisi-tion”. To evaluate the supply chain relation, we first identi-fied 20,000 possible supply chain relationships (from 19,334documents). We then sent these 20,000 possible relations toAmazon Mechanical Turk for manual annotation. Each taskwas sent to two different workers; in case of disagreementbetween the first two workers, a possible relation is thensent to a third worker in order to get a majority decision.The agreement rate between workers was 84%. Through thiscrowdsourcing process, we obtained 7,602 “supply-chain”relations as reported by the workers. We then checked thequality of a random sample of these relations and found thereported relations of high quality, so we used all the 7,602relations as groundtruth for our evaluation.

To evaluate the Merger & Acquisition (M&A) relation,we first identified 2,590 possible M&A relations (from 2,500documents). These possible relations were then manuallytagged by Thomson Reuters employees: each document wasannotated by one annotator. The quality of the tagged setwas further assessed by another employee by examining




TABLE 3Named Entity Recognition, Relation Extraction and Entity Linking Results

Task Entity/Relation Type Precision Recall F1

Entity

Company 0.94 0.75 0.83

Recognition

Person 0.91 0.87 0.89City 0.93 0.80 0.86

Country 0.95 0.89 0.92

Relation Supply Chain 0.76 0.46 0.57Extraction Merger & Acquisition 0.71 0.51 0.59

Entity Linking Company 0.92 0.89 0.90Person 0.92 0.73 0.81

(a) Named Entity Recognition (b) Entity Linking

Fig. 6. Runtime Evaluation.

randomly sampled annotations, and was found to be 92%accurate. The overall annotation process resulted in 603true Merger & Acquisition relations, which were used asgroundtruth for our evaluation.

Metrics. We use the standard evaluation metrics: Preci-sion, Recall and F1-score, as defined in Equation 1:

P =|correctly detected entities||totally detected entities|

R =|correctly detected entities||groundtruth entities|

, F1-score = 2 ∗P ∗ RP + R

(1)

The three metrics for relation extraction and entity linkingare defined in a similar manner by replacing “entities” with“relations” or “entity pairs” in the above three equations.

Results. Table 3 demonstrates the results of our NERcomponent on four different types of entities, the resultsof our relation extraction algorithm on two different rela-tions, and our entity linking results on two different typesof entities. We observe that the F1 score of our relationextraction algorithm is relatively low. Compared to the state-of-the-art algorithms, although evaluated on a different setof relationships, our algorithm actually achieves comparableperformance (e.g., compared to a F1-score of 55.5% in thepaper by Zhou et al. [4] and a F1-score of 51.9% from thesystem by Miwa and Bansal [15]). In future work, we plan toperform a more comprehensive evaluation of different typesof relations in order to better compare to the state-of-the-art.

In addition, we report the runtime of our NER and entitylinking components on two types of documents: Averageand Large. “Average” refers to a set of 5,000 documentswhose size is smaller than 15KB with an average size of2.99KB. “Large” refers to a collection of 1,500 documentswhose size is bigger than 15KB but smaller than 500KB (themaximum document size in our data) with an average sizeof 63.64KB. Figure 6 shows the average runtime per docu-ment in each category by varying the number of threads.

Analysis. For NER, from Table 3, we can see that oursystem generally achieves good precision; while many recallmistakes are due to missing entity aliases in our lexicons.Such errors are easily fixable by lexicon enrichment. Othermistakes are due to insufficient context, especially in non-narrative text (e.g., tables). For relation extraction, manyproblems are actually due to propagated errors from our en-tity extractors. We also discovered that our system performsbest on short sentences, where the two entities in questionare close; when the sentence is long and we try to identifya relation between two entities that are in different parts ofthe sentence, our system is more likely to have an error. Thisis due to the fact that most of our features are local, focusingon the immediate context of the two entities. In future work,we plan to use parsing-based features, and especially lookat the path between the two entities in the parse tree, whichwe think could be helpful.

6.2 Evaluation of Natural Language Querying

Dataset. We evaluate the runtime of the different compo-nents of TR Discover on a subset of our knowledge graph.Our evaluation dataset contains about 329 million entitiesand 2.2 billion triples. This dataset primarily covers the fol-lowing domains: Intellectual Property, Life Science, Financeand Legal. The major entity types include Drug, Company,Technology, Patent, Country, Legal Case, Attorney, Law Firm,Judge, etc. Various types of relationships exist between theentities, including Develop (Company develops Drug), Head-quartered in (Company headquartered in Country), InvolvedIn (Company involved in Legal Case), Presiding Over (LegalCase presided over by Judge), etc.

Infrastructure. We adopt two machines for evaluation:• Server-GraphDB: We host a free version of GraphDB,

a triple store, on a Oracle Linux machine with two2.8GHz CPUs (40 cores) and 256GB of RAM.

• Server-TRDiscover: We perform question understanding,auto-suggest, and FOL translation on a RedHat ma-chine with a 16-core 2.90GHz CPU and 264GB of RAM.

We use a dedicated server for hosting the GraphDB store, sothat the execution of the SPARQL queries is not interferedby other processes. A natural language question is first sentfrom an ordinary laptop to Server-TRDiscover for parsingand translation. If both processes finish successfully, thetranslated SPARQL query is then sent to Server-GraphDB forexecution. The results are then sent back to the laptop.

Random Question Generation. In order to evaluate theruntime of TR Discover, we randomly generated 10K natu-ral language questions using our auto-suggest component(Section 5.2). We give the auto-suggest module a startingpoint, e.g., drugs or cases, and then perform a depth-firstsearch to uncover all possible questions. At each depth,for each question segment, we select b most highly rankedsuggestions9; we then continue this search process with eachof the b suggestions. By setting different depth limits, wegenerate questions with different levels of complexity (i.e.,different number of verbs). Using this process, we generated

9. Choosing the most highly ranked suggestions helps increase thechance of generating questions that will result in non-empty result setsin order to better measure the execution time of SPARQL queries.




2K natural language questions for each number of verbsfrom 1 to 5, thus 10K questions in total.

Among these 10K questions, we present the evaluationresults on the valid questions. A question is consideredvalid if it successfully parses and its corresponding SPARQLquery returns a non-empty result set. Our parser (Section5.1) relies on a grammar (i.e., a set of rules) for questionunderstanding; as the number of rules increases, it is pos-sible that the parser may not be able to apply the right setof rules to understand a question, especially a complex one(e.g., with 5 verbs). Also, as we increase the number of verbsin a question (i.e., adding more query constraints in the finalSPARQL query), it is more likely for a query to return anempty result set. In both cases, the runtime is faster thanwhen successfully finishing the entire process with a non-empty result set. Thus, we only report the results on validquestions.

Runtime Results. Figure 7 shows the runtime of naturallanguage parsing, FOL translation and SPARQL executionrespectively. According to Figure 7(a), unless a questionbecomes truly complicated (with 4 or 5 verbs), the parsingtime is generally around or below 3 seconds. One examplequestion with 5 verbs could be Patents granted to companiesheadquartered in Australia developing drugs targeting Lectinmannose binding protein modulator using Absorption enhancertransdermal. We believe that questions with more than 5verbs are rare, thus we did not evaluate questions beyondthis level of complexity. In our current implementation, weadopt NLTK10 for question parsing; however, we supplyNLTK with our own FCFG grammar and lexicon.

From Figure 7(b), we can see that only a few millisecondsare needed for translating the FOL of a natural languagequestion to a SPARQL query. In general, the translatoronly needs to traverse the FOL parse tree (Figure 5) andappropriately combines the different query constraints.

Finally, we demonstrate the execution time and the resultset size of the translated SPARQL queries in Figure 7(c).For questions of all complexity levels, the average executiontime is below 500 milliseconds, showing the potential ofapplying a triple store to real-world scenarios with a similarsize of data. As we increase the number of verbs in aquestion, the runtime actually goes down, since GraphDBis able to utilize the relevant indices on the triples to quicklyfind potential matches. In addition, all of our 5-verb testingquestions generate an empty result set, thus here a questionis valid as long as it successfully parses.

Even though the effectiveness (i.e., precision and recall)of TR Discover is not discussed in this paper (due to spacelimit), it is studied in our previous conference paper [16].

6.3 Time Complexity AnalysisFor our NLP modules, the complexity of entity extraction isO(n + k*logk), where n is the length of the input documentand k is the number of entity candidates in it (k � n withsome edge cases with a large number of candidates). Theworst-case complexity of our relation extraction componentis O(n + l2), where n is the length of the input document,and l is the number of extracted entities, as we consider allpairs of entities in the candidate sentences. The complexity

10. http://www.nltk.org/

of linking a single entity is O(b*r2), where b is the block size(i.e., the number of linking candidates) and r is the numberof attributes for a given entity.

For natural language interface, the time complexity ofparsing a natural language question to its First Order Logicrepresentation (FOL) is O(n3), where n is the number ofwords in a question. We then parse the FOL to an FOL parsetree with time complexity O(n4). Next, the FOL parse tree istranslated to a SPARQL query with in-order traversal withO(n) complexity. Finally, the SPARQL query is executedagainst the triple store. The complexity here is largely de-pendent on the nature of the query itself (e.g., the number ofjoins) and the implementation of the SPARQL query engine.

7 RELATED WORK

Never-Ending Language Learning (NELL) [17] and OpenInformation Extraction (OpenIE) [18] are two efforts in ex-tracting knowledge facts from a broad range of domains forbuilding knowledge graphs. With the extracted knowledgefacts, Pujara et al. [19] proposed an approach for noiseremoval and knowledge inference. In the Semantic Webcommunity, DBpedia [20] and Wikidata [21] are two of thenotable efforts in this area. The latest version of DBpediahas 4.58 million entities, including 1.5 million persons, 735Kplaces and 241K organizations, among others. Wikidatacovers a broad range of domains and currently has morethan 17 million “data items” that include specific entitiesand concepts. Various efforts have also been devoted tocreating knowledge graphs in multiple languages [22], [23].

Named Entity Recognition. Early attempts for entityrecognition relied on linguistic rules and grammar-basedtechniques [24], [25]. Recently, most research now focusedon the use of statistical models. A common approachis to use Sequence Labeling techniques, such as hiddenMarkov Models [26], conditional random fields [27] andmaximum entropy [28]. These methods rely on languagespecific features, which aim to capture linguistic subtletiesand to incorporate external knowledge bases [29]. Withthe advancement of deep learning techniques, there havebeen several successful attempts to design neural networkarchitectures to solve the NER problem without the need todesign and implement specific features [30], [31].

Relation Extraction. Similar to NER, this problem wasinitially approached with rule-based methods [25]. Later at-tempts include the combination of statistical machine learn-ing and various NLP techniques for relation extraction, suchas syntactic parsing [32], [33] and chunking [4]. Recently,several neural network-based algorithms have been pro-posed for relation extraction [34], [35]. In addition, researchhas shown that the joint modeling of entity recognitionand relation extraction can achieve better results that thetraditional pipeline approach [36].

Entity Linking. Linking extracted entities to a referenceset of named entities is another important task to buildinga knowledge graph. The foundation of statistical entitylinking lies in the work of the U.S. Census Bureau onrecord linkage [37]. These techniques were generalized forperforming entity linking tasks in various domains [38]. Inrecent years, special attention was given to linking entities toWikipedia by employing word disambiguation techniques




(a) Question Understanding (b) FOL to SPARQL Translation (c) SPARQL Query Execution

Fig. 7. Runtime Evaluation.

and relying on Wikipedia’s specific attributes [39], [40]. Suchapproaches are then generalized for linking entities to otherknowledge bases as well [41], [42].

Natural Language Interface (NLI). Keyword search [43]has been frequently adopted for retrieving information fromknowledge bases. Although researchers have investigatedhow to best interpret the semantics of keyword queries[44], oftentimes, users may still have to figure out the mosteffective queries themselves in order to retrieve relevant in-formation. In contrast, TR Discover accepts natural languagequestions, enabling users to express their search requests ina more intuitive fashion. By understanding and translating anatural language question to a structured query, our systemthen retrieves the exact answer to the question.

NLIs have been applied to various domains [16], [45].Much of the prior work parses a natural language questionwith various NLP techniques, utilizes the identified entities,concepts and relationships to build a SPARQL or a SQLquery, and retrieves answers from the corresponding datastores, e.g., a triple store [12], [46] or a relational database[45]. In addition to adopting fully automatic question under-standing, CrowdQ also utilizes crowd sourcing techniquesfor understanding natural language questions [47]. Insteadof only using structured data, HAWK [48] utilizes bothstructured and unstructured data for question answering.

Compared to the state-of-the-art, we maintain flexibilityby first parsing a question into First Order Logic, which isfurther translated into SPARQL. Using FOL allows us to beagnostic to which query language will be used later. Wedo not incorporate any query language statements directlyinto the grammar, keeping our grammar leaner and moreflexible for adapting to other query languages. Anotherdistinct feature of our system is that it helps users to builda complete question by providing suggestions accordingto a partial question and a grammar. Although ORAKEL[49] also maps a natural language question to a logicalrepresentation, no auto-suggest is provided to the users.

Knowledge Graph in Practice. The Google KnowledgeGraph has about 570 million entities as of 2014 [1] andhas been adopted to power Google’s online search. Yahooand Bing11 are also building their own knowledge graphs tofacilitate search. Facebook’s Open Graph Protocol12 allows

11. http://blogs.bing.com/search/2013/03/21/understand-your-world-with-bing/

12. http://ogp.me/

users to embed rich metadata into webpages, which essen-tially turns the entire web into a big graph of objects ratherthan documents. In terms of data, the New York Timeshas published data in RDF format13 (5,000 people, 1,500organizations and 2,000 locations). The British BroadcastingCorporation has also published in RDF, covering a muchmore diverse collection of entities14, e.g., persons, places,events, etc. Thomson Reuters now also provides free accessto part of its knowledge graph15 (3.5 million companies, 1.2million equity quotes and others).

8 DISCUSSION

8.1 Challenges and Lessons Learned

Towards Generic Data Transformation and Integration.State-of-the-art NER and relation extraction techniques havebeen mainly focused on common entity types [29], suchas locations, people and organizations; however, our datacovers a much more diverse types of entities, includingdrugs, medical devices, regulations, legal topics, etc., thusrequiring a more generic capability. While developing ourNLP components, we performed internal evaluation againstother systems and discovered that our internally developedmodules enable us to better meet our product needs (e.g.,high precision for NER). Furthermore, the capability of be-ing able to integrate such mined information from unstruc-tured data with existing structured data and to ultimatelygenerate insights for our users based upon such integrateddata is key to the success of our business. In this paper, wehave presented a number of publicly available services forinformation extraction and integration. In future work, weplan to improve domain coverage and performance.

Although these techniques are used to build and querythe graph in the first place, these services can also benefitfrom information in the knowledge graph. First of all, ourknowledge graph is used to create gazetteers and entityfingerprints, which help to improve the performance ofour NER engine. For example, company information, suchas industry, geographical location and products, from theknowledge graph is used to create a company fingerprint.For entity linking, when a new entity is recognized froma free text document, the information from the knowledge

13. http://data.nytimes.com/14. http://www.bbc.co.uk/things/15. https://permid.org/




graph is used to identify candidate nodes that this newentity might be linked to. Finally, our natural languageinterface relies on a grammar for question parsing, whichis built based upon information from the knowledge graph,such as the entity types (e.g., company and person) and theirrelationships (e.g., “works for”).

Data Modeling. Our content covers diverse domains thatrange from finance to intellectual property & science andto legal and tax. It would be difficult for our engineersto precisely model such a complex space of domains andconvert the ingested and integrated data into RDF triples.As we have initially attempted to adopt this data modelingapproach, it has become clear that this is severely con-strained by our engineering staffs’ relative lack of expertisein the content. This is pushing us towards a need to invest ineditorial focused self-service tooling to separate the softwareand content expertise. Rather than having engineers un-derstand and perform the modeling, we collaborate closelywith our editorial colleagues in order to model the data,apply the model to new contents, and embed the semanticsinto our data alongside its generation.

Distributed and Efficient RDF Data Processing. Therelative scarcity of distributed tools for storing and queryingRDF triples is another challenge. This reflects the inherentcomplexities of dealing with graph-based data at scale. Stor-ing all triples in a single node would allow efficient graphoperations while this approach may not scale well whenwe have an extremely large number of triples. Althoughwe have been studying existing approaches for distributedRDF data processing and querying, these approaches oftenrequire a large and expensive infrastructure [50]. Our cur-rent solution is to use a highly scalable data warehouse (e.g.,Apache Cassandra16 and Elasticsearch) for storing the RDFtriples; in the meanwhile, slices of this graph can then beretrieved from the entire graph, put in specialized stores,and optimized to meet particular user needs.

Converging Triples from Multiple Sources. Another chal-lenge is the lack of inherent capability within RDF forupdate and delete operations, particularly when multiplesources converge predicates under a single subject. In thisscenario, one cannot simply delete all predicates and applythe new ones: triples from another source will be lost. Whilea simplistic solution might be to delete by predicate, thisapproach does not account for the same predicate comingfrom multiple sources. For example, if two sources state a“director-of” predicate for a given subject, an update fromone source cannot delete the triple from the other source.Our solution is to use quads with the fourth element as anamed graph allowing us to track the source of the tripleand act upon subsets of the predicates under a subject.

Natural Language Interface. The first challenge is thetension between the desire to keep the grammar lean andthe need for broad coverage. Our current grammar is highlylexicalized, i.e., all entities (lawyers, drugs, persons, etc.)are maintained as entries to the grammar. As the size ofgrammar expands, the complexity of troubleshooting issuesthat arise increases as well. For example, a grammar with1.2 million entries takes about 12 minutes to load on ourserver, meaning that troubleshooting even minor issues on

16. http://cassandra.apache.org/

the full grammar can take several hours. As a solution, weare currently exploring options to delexicalize portions ofthe grammar, namely collapsing entities of the same type,thus dramatically reducing the size of the grammar.

The second issue is increasing the coverage of the gram-mar without the benefit of in-domain query logs both interms of paraphrases (synonymous words and phrases thatmap back to the same entity type and semantics) andsyntactic coverage for various constructions that can beused to pose the same question. To get around this issue,we use crowdsourced question paraphrases to expand thecoverage of both the lexical and syntactic variants. Forexample, although we cover questions like which companiesare developing cancer drugs, users also supplied paraphraseslike which companies are working on cancer medications thusallowing us to add entries such as working on as a synonymfor develop and medication as a synonym for drug.

Finally, our ultimate aim is to design a flexible NLI thatis portable from domain to domain. The initial creationof the grammar took about 2 months of research, design,and experimentation, and is used for the deployment onThomson Reuters Cortellis17 (Cortellis). We further adaptedthe grammar for experiments in the biomedical domain andfound that adaptation took 30 person hours. Our anotheradaptation to support a Thomson Reuters Legal dataset tookabout 2 weeks due to the size and complexity of the data.We are continuing to explore ways to reduce the time andexpertise needed for domain adaptation.

8.2 Our Vision for the GraphIn building out the graph of this scale, we have found itimportant to have some guiding principles.

Broad Domain Coverage. Both the logical and physicalmanifestations of the graph are assets for the whole com-pany. Therefore, it is critical that the graph represent con-tents from every part of the business. In addition, externaland in particular open data should be included to furtheraugment the graph.

All of the Data - But not Every Facet. In our view, anygiven data item the enterprise returns will diminish withgreater detail. For example, the fundamentals of a companyare useful to many of Thomson Reuters’ products regardlessof business unit; however, detailed bond yields for a specificcompany are only of interest to our F&R customers. Theimplication is that the graph is far more useful as a broadasset, spanning many content types rather than a deep assetfocused on all the details of a particular data type. Webelieve that such details need to be expressed to the rightlevel. For instance, for a free text document, rather thanputting the actual contents into the graph, we will store thelocation of the document in the graph for users’ easy access.

Technology Independent and Product Agnostic. We usethe RDF suite of standards to describe and manipulate data.Doing so insulates us from technology lock-in and allows usto grow with the market. Furthermore, the graph has to beagnostic of any product. Our knowledge graph representsa global view of our data without being designed for anyspecific products. Individual products can then pull relevantdata from it to satisfy their specific needs while enjoying the

17. http://lifesciences.thomsonreuters.com/products/cortellis




rich information that connections in the graph may provide.This way, our graph serves as a central source of data withlinkages among the pieces and can be used for enhancingexisting products and the development of new ones.

8.3 Deployment StatusAt the time of writing, our knowledge graph has been inproduction for one year and is replicated for high availabil-ity across two data centers. The underlying big data basedcontent pipelines (i.e., data acquisition, transformation andintegration) continually update the graph with new factsand throttling is used to preserve the balance between ingestand online availability. The domain coverage of the contentswithin the graph continues to expand as we build outadditional products dependent on the graph.

The natural language interface, i.e., TR Discover, has beendeployed to different slices of our entire knowledge graph:Cortellis and Legal. Cortellis is a data integration and searchplatform for professional users in the Pharmaceutical in-dustry, covering data in Life Sciences, Intellectual Property,Legal and Finance. Our Legal sub-graph consists of legalcases, judges, attorneys, etc. Currently, users can search forinformation (such as drug repurposing, a company’s legalcases, or patents of a particular company) with keywordqueries for both domains. With TR Discover, customers canthen retrieve information in a more intuitive manner, com-pared to their current keyword-based query mechanism.

9 CONCLUSION

In this paper, we present our effort in building and queryingThomson Reuters’ knowledge graph. Data in heterogeneousformats is first acquired from various sources. We thendevelop named entity recognition, relation extraction andentity linking techniques for mining information from thedata and integrating the mined data across different sources.We model and store our data in RDF triples, and present TRDiscover that enables users to search for information withnatural language questions. We evaluate and demonstratethe practicability of our knowledge graph. In future work,we would like to enhance our NLP algorithms in orderto cover more domains. Also, rather than relying on apre-defined grammar for understanding natural languagequestions, we will explore the possibility of developinga more flexible question parser. Finally, we will deployour knowledge graph to more products and improve ourvarious services according to customer feedback.

TRADEMARK ACKNOWLEDGMENT

One or more names mentioned in the paper are trademarksor registered trademarks of their respective owners.

REFERENCES

[1] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: a web-scale approach to probabilistic knowledge fusion,” in The 20thACM SIGKDD Int’l Conference on Knowledge Discovery and DataMining (KDD), 2014, pp. 601–610.

[2] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “LIBLINEAR: Alibrary for large linear classification,” Journal of Machine LearningResearch, vol. 9, pp. 1871–1874, 2008.

[3] L. Ratinov and D. Roth, “Design challenges and misconceptions innamed entity recognition,” in Proceedings of the Thirteenth Confer-ence on Computational Natural Language Learning, 2009, pp. 147–155.

[4] G. Zhou, J. Su, J. Zhang, and M. Zhang, “Exploring variousknowledge in relation extraction,” in Proceedings of the 43rd AnnualMeeting of the Association for Computational Linguistics ACL, 2005.

[5] P. Christen, “A survey of indexing techniques for scalable recordlinkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24,no. 9, pp. 1537–1555, 2012.

[6] S. Veeramachaneni and R. K. Kondadadi, “Surrogate learning:From feature independence to semi-supervised classification,” inProceedings of the NAACL HLT 2009 Workshop on Semi-SupervisedLearning for Natural Language Processing, 2009, pp. 10–18.

[7] C. Dozier, H. Molina-Salgado, M. Thomas, and S. Veeramachaneni,“Concord - a tool that automates the construction of record reso-lution systems,” in Proceedings of Entity Workshop of LREC, 2010.

[8] A. Harth and S. Decker, “Optimized index structures for queryingRDF from the web,” in Third Latin American Web Congress, 2005,pp. 71–80.

[9] J. J. Carroll, C. Bizer, P. J. Hayes, and P. Stickler, “Named graphs,provenance and trust,” in Proceedings of the 14th Int’l conference onWorld Wide Web (WWW), 2005, pp. 613–622.

[10] L. Matteis, A. Hogan, and R. Navigli, “Keyword-based navigationand search over the linked data web,” in Proceedings of the Workshopon Linked Data on the Web (LDOW), 2015.

[11] R. C. Cornea and N. B. Weininger, “Providing autocomplete sug-gestions,” Feb. 4 2014, US Patent 8,645,825.

[12] C. Unger, L. Buhmann, J. Lehmann, A. N. Ngomo, D. Gerber, andP. Cimiano, “Template-based question answering over RDF data,”in 21st World Wide Web Conference, 2012, pp. 639–648.

[13] J. Bovet and T. Parr, “Antlrworks: an ANTLR grammar devel-opment environment,” Software: Practice and Experience, vol. 38,no. 12, pp. 1305–1332, 2008.

[14] B. Bishop, A. Kiryakov, D. Ognyanoff, I. Peikov, Z. Tashev, andR. Velkov, “OWLIM: A family of scalable semantic repositories,”Semantic Web, vol. 2, no. 1, pp. 33–42, 2011.

[15] M. Miwa and M. Bansal, “End-to-end relation extraction us-ing lstms on sequences and tree structures,” CoRR, vol.abs/1601.00770, 2016.

[16] D. Song, F. Schilder, C. Smiley, C. Brew, T. Zielund, H. Bretz,R. Martin, C. Dale, J. Duprey, T. Miller, and J. Harrison, “TRdiscover: A natural language interface for querying and analyzinginterlinked datasets,” in 14th Int’l Semantic Web Conference, 2015,pp. 21–37.

[17] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., andT. M. Mitchell, “Toward an architecture for never-ending languagelearning,” in Proceedings of the Twenty-Fourth AAAI Conference onArtificial Intelligence, 2010.

[18] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam,“Open information extraction: The second generation,” in 22ndInt’l Joint Conference on Artificial Intelligence, 2011, pp. 3–10.

[19] J. Pujara, H. Miao, L. Getoor, and W. W. Cohen, “Knowledge graphidentification,” in Int’l Semantic Web Conference, 2013, pp. 542–557.

[20] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N.Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, andC. Bizer, “Dbpedia - A large-scale, multilingual knowledge baseextracted from wikipedia,” Semantic Web, vol. 6, no. 2, pp. 167–195, 2015.

[21] D. Vrandecic and M. Krotzsch, “Wikidata: a free collaborativeknowledgebase,” Commun. ACM, vol. 57, no. 10, pp. 78–85, 2014.

[22] R. Navigli and S. P. Ponzetto, “Babelnet: The automatic construc-tion, evaluation and application of a wide-coverage multilingualsemantic network,” Artificial Intelligence, vol. 193, pp. 217–250,2012.

[23] C. Wang, M. Gao, X. He, and R. Zhang, “Challenges in chineseknowledge graph construction,” in 31st IEEE International Confer-ence on Data Engineering Workshops, 2015, pp. 59–61.

[24] M. A. Hearst, “Automatic acquisition of hyponyms from large textcorpora,” in 14th Int’l Conference on Computational Linguistics, 1992,pp. 539–545.

[25] R. Feldman, Y. Aumann, Y. Liberzon, K. Ankori, Y. Schler, andB. Rosenfeld, “A domain independent environment for creatinginformation extraction modules,” in 2001 ACM Int’l Conference onInformation and Knowledge Management, 2001, pp. 586–588.

[26] G. Zhou and J. Su, “Named entity recognition using an hmm-based chunk tagger,” in Proceedings of the 40th Annual Meeting ofthe Association for Computational Linguistics, 2002, pp. 473–480.




[27] A. McCallum and W. Li, “Early results for named entity recogni-tion with conditional random fields, feature induction and web-enhanced lexicons,” in Proceedings of the Seventh Conference onNatural Language Learning, 2003, pp. 188–191.

[28] H. L. Chieu and H. T. Ng, “Named entity recognition: A max-imum entropy approach using global information,” in 19th Int’lConference on Computational Linguistics (COLING), 2002.

[29] D. Nadeau and S. Sekine, “A survey of named entity recognitionand classification,” Lingvisticae Investigationes, vol. 30, no. 1, pp.3–26, 2007.

[30] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, andP. P. Kuksa, “Natural language processing (almost) from scratch,”Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.

[31] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, andC. Dyer, “Neural architectures for named entity recognition,”ArXiv e-prints, 2016.

[32] S. Miller, H. Fox, L. A. Ramshaw, and R. M. Weischedel, “Anovel use of statistical parsing to extract information from text,”in 1st Meeting of the North American Chapter of the Association forComputational Linguistics, 2000, pp. 226–233.

[33] D. Zelenko, C. Aone, and A. Richardella, “Kernel methods forrelation extraction,” Journal of Machine Learning Research, vol. 3,pp. 1083–1106, 2003.

[34] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, “Semantic com-positionality through recursive matrix-vector spaces,” in EMNLP-CoNLL, 2012, pp. 1201–1211.

[35] C. N. dos Santos, B. Xiang, and B. Zhou, “Classifying relationsby ranking with convolutional neural networks,” in 53rd AnnualMeeting of the Association for Computational Linguistics and the 7thInt’l Joint Conference on Natural Language Processing of the AsianFederation of Natural Language Processing, 2015, pp. 626–634.

[36] Q. Li and H. Ji, “Incremental joint extraction of entity mentionsand relations,” in Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics (ACL), 2014, pp. 402–412.

[37] W. E. Winkler, “The state of record linkage and current researchproblems,” in Statistical Research Division, US Census Bureau, 1999.

[38] W. Shen, J. Wang, and J. Han, “Entity linking with a knowledgebase: Issues, techniques, and solutions,” IEEE Trans. Knowl. DataEng., vol. 27, no. 2, pp. 443–460, 2015.

[39] P. Ferragina and U. Scaiella, “Fast and accurate annotation of shorttexts with wikipedia pages,” IEEE Software, vol. 29, no. 1, pp. 70–75, 2012.

[40] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran,“Evaluating entity linking with wikipedia,” Artif. Intell., vol. 194,pp. 130–150, 2013.

[41] P. N. Mendes, M. Jakob, A. Garcıa-Silva, and C. Bizer, “Dbpediaspotlight: shedding light on the web of documents,” in 7th Int’lConference on Semantic Systems, 2011, pp. 1–8.

[42] A. Sil, E. Cronin, P. Nie, Y. Yang, A. Popescu, and A. Yates,“Linking named entities to any database,” in EMNLP-CoNLL,2012, pp. 116–127.

[43] M. d’Aquin and E. Motta, “Watson, more than a semantic websearch engine,” Semantic Web Journal, vol. 2, no. 1, pp. 55–63, 2011.

[44] C. Bobed and E. Mena, “Querygen: Semantic interpretation ofkeyword queries over heterogeneous information systems,” Inf.Sci., vol. 329, pp. 412–433, 2016.

[45] F. Li and H. V. Jagadish, “Constructing an interactive naturallanguage interface for relational databases,” PVLDB, vol. 8, no. 1,pp. 73–84, 2014.

[46] M. Yahya, K. Berberich, S. Elbassuoni, and G. Weikum, “Robustquestion answering over the web of linked data,” in Int’l Confer-ence on Information and Knowledge Management, 2013, pp. 1107–1116.

[47] G. Demartini, B. Trushkowsky, T. Kraska, and M. J. Franklin,“CrowdQ: Crowdsourced query understanding,” in CIDR 2013,Sixth Biennial Conference on Innovative Data Systems Research, 2013.

[48] R. Usbeck, A.-C. Ngonga Ngomo, L. Buhmann, and C. Unger,“HAWK - hybrid question answering over linked data,” in 12thExtended Semantic Web Conference, 2015.

[49] P. Cimiano, P. Haase, J. Heizmann, M. Mantel, and R. Studer,“Towards portable natural language interfaces to knowledge bases- the case of the ORAKEL system,” Data Knowl. Eng., vol. 65, no. 2,pp. 325–354, 2008.

[50] “Oracle spatial and graph: Benchmarking a trillion edges rdfgraph,” http://download.oracle.com/otndocs/tech/semanticweb/pdf/OracleSpatialGraph RDFgraph 1 trillion Benchmark.pdf, 2014.

Dan Bennett is responsible for building Big, Open and Linked Datacapabilities for re-use across Thomson Reuters.

Dezhao Song is a Research Scientist at Thomson Reuters, focusing onEntity Coreference and Question Answering in the Semantic Web.

Frank Schilder is a Research Director at Thomson Reuters. His re-search focus is Machine Learning and Computational Linguistics.

Shai Hertz leads the algorithms team of Thomson Reuters’ Text Meta-data Services, working on Natural Language Processing.

Giuseppe Saltini is an Information Architect at Thomson Reuters,defining Linked Data strategy and architecture for the company.

Charese Smiley is a Research Scientist at Thomson Reuters. Her re-search interests are Question Answering and Computational Linguistics.

Phani Nivarthi is a Senior Research Engineer at Thomson Reuters. Hisresearch focus is Natural Language Processing.

Oren Hazai is a Research Engineer at Thomson Reuters. His primaryresearch focus is Natural Language Processing.

Dudi Landau leads the technology aspects of Thomson Reuters TextMetadata Services, which includes developing the knowledge graph.

Mike Zaharkin is a Principal Architect at Thomson Reuters. His primaryresearch focus is Big Data Solutions in the Hadoop Ecosystem.

Tom Zielund is a Senior Research Scientist at Thomson Reuters,working on mining, modeling, and management of structured data.

Hugo Molina-Salgado is a Senior Research Scientist at ThomsonReuters, working on Natural Language Processing and Summarization.

Chris Brew is a Senior Research Scientist at Thomson Reuters. Hisprimary research focus is Natural Language Processing.

Documents

IEEE TRANSACTIONS ON SERVICES COMPUTING 1 ...static.tongtianta.site/paper_pdf/62a9a702-5ce4-11e9-848f...querying of this mined and integrated data, i.e., the knowledge graph, we propose