The Semantic Knowledge Graph: A compact, auto … · each items is to the list, ... indirection exists between each node and the edge connecting ... Graph represents a novel new graph

The Semantic Knowledge Graph: A compact,auto-generated model for real-time traversal and

ranking of any relationship within a domain

Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, and Andries SmithCareerBuilder, Norcross, GA, USA

Email: {trey.grainger, khalifeh.aljadda, mohammed.korayem, andries.smith}@careerbuilder.com

Abstract—This paper describes a new kind of knowledgerepresentation and mining system which we are calling the Se-mantic Knowledge Graph. At its heart, the Semantic KnowledgeGraph leverages an inverted index, along with a complemen-tary uninverted index, to represent nodes (terms) and edges(the documents within intersecting postings lists for multipleterms/nodes). This provides a layer of indirection between eachpair of nodes and their corresponding edge, enabling edges tomaterialize dynamically from underlying corpus statistics. As aresult, any combination of nodes can have edges to any othernodes materialize and be scored to reveal latent relationshipsbetween the nodes. This provides numerous benefits: the knowl-edge graph can be built automatically from a real-world corpusof data, new nodes - along with their combined edges - canbe instantly materialized from any arbitrary combination ofpreexisting nodes (using set operations), and a full model of thesemantic relationships between all entities within a domain canbe represented and dynamically traversed using a highly compactrepresentation of the graph. Such a system has widespreadapplications in areas as diverse as knowledge modeling andreasoning, natural language processing, anomaly detection, datacleansing, semantic search, analytics, data classification, rootcause analysis, and recommendations systems. The main con-tribution of this paper is the introduction of a novel system -the Semantic Knowledge Graph - which is able to dynamicallydiscover and score interesting relationships between any arbitrarycombination of entities (words, phrases, or extracted concepts)through dynamically materializing nodes and edges from acompact graphical representation built automatically from acorpus of data representative of a knowledge domain. The sourcecode for our Semantic Knowledge Graph implementation is beingpublished along with this paper to facilitate further research andextensions of this work.1

I. INTRODUCTION

Graphs are a well-studied class of data structures used tomodel relationships (edges) between entities (nodes). Knowl-edge bases in general, and ontologies specifically, model a do-main by defining how different entities within the domain arerelated. Such knowledge bases are most commonly representedas a graph, and both the nodes and the edge relationshipsbetween nodes in that graph must be explicitly modeled eithermanually by a domain expert or automatically leveraging anontology learning system. Because building such a knowledgebase typically requires explicitly modeling nodes and edgesinto a graph ahead of time, this unfortunately presents severallimitations to the use of such a knowledge graph:

1http://github.com/careerbuilder/semantic-knowledge-graph/tree/dsaa2016

• Entities not modeled explicitly as nodes have noknown relationships to any other entities.

• Edges exist between nodes, but not between arbitrarycombinations of nodes, and therefore such a graphis not ideal for representing nuanced meanings of anentity when appearing within different contexts, as iscommon within natural language.

• Substantial meaning is encoded in the linguistic repre-sentation of the domain that is lost when the underly-ing textual representation is not preserved: phrases,interaction of concepts through actions (i.e. verbs),positional ordering of entities and the phrases con-taining those entities, variations in spelling and otherrepresentations of entities, the use of adjectives tomodify entities to represent more complex concepts,and aggregate frequencies of occurrence for differentrepresentations of entities relative to other representa-tions.

• It can be an arduous process to create robust ontolo-gies, map a domain into a graph representing thoseontologies, and ensure the generated graph is compact,accurate, comprehensive, and kept up to date.

We propose a new system for modeling relationshipsbetween entities that overcomes these limitations. This system,which we refer to as a Semantic Knowledge Graph, is aimedat extracting and representing the knowledge of a domain au-tomatically from a corpus of documents representative of thatdomain. The underlying representation ultimately encodes thesemantic relationships between words, phrases, and extractedconcepts in such a way that those relationships can later surfaceto expose new insights about the interrelationships between allentities within the domain.

This kind of system has numerous applications which wewill explore. It can be used to automatically discover sets ofrelated terms within a domain, to represent and disambiguatemultiple meanings of the same phrases, to power semanticsearch by dynamically expanding user queries to conceptually-related keywords/phrases, to identify trending topics acrosstime-series data, to build a content-based recommendation en-gine, to perform data cleansing on lists by scoring how relevanteach items is to the list, to perform document summarizationby detecting the importance of each phrase and entity withina document, and to do predictive analytics on time series data.

In its most basic use case (a corpus of free-text documents),a Semantic Knowledge Graph can be leveraged to automat-

arX

iv:1

609.

0046

4v2

[cs

.IR

] 5

Sep

201

6

ically discover domain-specific relationships between entitieswithin a domain. Given a corpus of documents also containingsome amount of structured information (specific fields fortitles, categories, dates, or other specific kinds of entities), itwill treat each of those field types as a new edge that can betraversed between any two nodes co-occurring within the samedocuments with some (specifiable) minimum frequency.

One of the novelties of the system is that a layer ofindirection exists between each node and the edge connectingit to any other node. Instead of explicitly defining an edgeconnecting two nodes with a predetermined relationship, asmost graph databases are designed, a Semantic KnowledgeGraph instead materializes edges during traversal between anytwo nodes based upon the intersection of the document sets towhich both of the nodes link. Furthermore, because the edgesbetween nodes are dynamically materialized based upon theset of shared documents to which they both link, this meansthat it is also possible to dynamically materialize new nodesby combining existing nodes (through their underlying sets ofdocuments) in any arbitrarily-complex way. This subsequentlymeans that any arbitrarily-complex nodes (for example, anylinguistic combination of character sequences, terms, andterm sequences) can also be decomposed into their minimumconstituent parts (terms related by position within documents)when building the graph, enabling a highly-compressed graphrepresentation which is capable of reconstituting and traversingevery existing relationship within a knowledge domain.

As a result of maintaining all of the corpus occurrencestatistics about each node, a Semantic Knowledge graph canalso dynamically discover and score interesting relationshipsbetween any nodes based upon the statistical similarity ofthe nodes in any given context. The Semantic KnowledgeGraph represents a novel new graph model which is bothauto-generated and yet able to represent, traverse, and scoreevery relationship represented within a corpus of documentsrepresenting a knowledge domain.

II. RELATED WORK

Ontologies can be defined as explicit formal specificationsof the terms within a domain and the relations among them[1]. Ontologies have become common across various domainsfor building vocabulary to be shared and used by domainexperts. Many advantages can be gained by building a commonvocabulary, including improving the re-usability of domainknowledge, enabling a common understanding of the structureof information, and providing the ability to analyze domainknowledge. Ontologies can be classified into three differentcategories [2]: formal ontologies that have axioms and defini-tions in logic, terminological ontologies (e.g, WordNet [3]),and prototype-based ontologies having typical instances orprototypes instead of axioms. Recently, large-scale knowledgebases that utilize ontologies (FreeBase [4], DBpedia [5], andYAGO [6, 7]) have been constructed using structured sourcessuch as Wikipedia infoboxes. Other approaches (DeepDive [8],Nell2RDF [9], and PROSPERA [10]) crawl the web anduse machine learning and natural language processing to buildweb-scale knowledge graphs.

Existing work on ontologies and knowledge bases still suf-fers from significant limitations. Manually-created knowledgebases are expensive and labor-intensive to build and maintain

and are thus generally incomplete and have a tendency to growout of date over time. While ontology learning systems aretypically able to automate much of the ontology building (andsometimes maintenance) process, this comes at the expense ofa loss of accuracy due to the replacement of human expertswith more error-prone algorithms [11–15].

Current ontology learning systems also throw away asubstantial amount of information encoded within the textualcontent they are processing. For example, any entities notdiscovered as nodes during the ontology mining process haveno known relationships to any other entities, regardless ofwhether those relationships were actually represented withinthe analyzed content (and just overlooked) during the ontologymining process. Furthermore, since most terms and phrases cantake on alternate, nuanced meanings within different contexts,these nuanced meanings are often lost when representing termsand phrases as single nodes independent of the context inwhich they are used. Finally, a substantial amount of meaningis encoded in the nuanced linguistic representations present ina corpus of free-text content (terms, character sequences, termordering, placement of words within phrases and paragraphs,and so on). Existing ontology creation approaches fail to ade-quately support the representation and scoring of relationshipsbetween these nuanced and complex interrelationships.

Our work improves upon current ontology mining ap-proaches by creating a knowledge graph which can fullyrepresent the nuanced relationships between every entity (term,phrase, or other textual representation) represented within acorpus of free-text documents, as well as traverse and scorethe strength of those relationships or of any combination ofthose relationships.

III. METHODOLOGY

A. Problem Description

Technology platforms are becoming increasingly morecapable every day of interpreting and responding to domain-specific and personalized questions. Search engines and rec-ommendation engines, in particular, can barely compete unlessthey leverage models containing deep insights into the kindsof questions being asked and - more importantly - the kindsof answers being sought. One of the most common ways ofrepresenting a domain in order to surface these insights isthrough the use of ontologies - combinations of taxonomiescontaining known entities, their properties, and their interrela-tionships. These ontologies can then be integrated into a searchapplication in order to improve its ability to meet the end-user’s information need. For example, if someone searchesfor the term server in the information technology domain,it has a very different meaning (a computer server) than inthe restaurant domain (a waiter/waitress), and if someone isusing a job search engine, it could actually represent eithermeaning depending upon the user’s context. Ontologies canhelp represent the relationships between entities such that theycan be used to improve the accuracy of the system at meetingits users’ information needs.

Ontologies are usually built manually by human experts,making them expensive to build, maintain, and update. Tocombat this, ontology learning systems, which attempt toautomatically learn relationships from a domain and then mapthem into an ontology, are becoming more prevalent [16].

Fig. 1. Semantic relationships encoded in free text content. Terms arecomposed of one or more character sequences, term sequences are composedof one or more terms, and documents are composed of one or more fieldscontaining zero or more term sequences.

We would like to create a system that is able to au-tomatically generate a graph representation of a knowledgedomain simply from ingesting a corpus of data representing thedomain, while simultaneously preserving all of the linguisticand statistical relationships between the keywords, phrases,and extracted entities from the corpus. Once a model (agraph) is built from this data, we can then leverage it tobetter understand the interrelationships between those words,phrases, and entities.

Natural language, as represented in full-text documents,contains tremendous meaning compressed within its linguisticstructures, represented through multiple levels of abstraction:

• Corpus: a list of documents representative of a knowl-edge domain

• Document: a list of fields relating to each otherthrough some underlying entity

• Field: a grouping of zero or more term sequencesrepresentative of a relationship with a document.

• Term Sequence: an ordered representation of one ormore terms

• Term: a character sequence representing a knownmeaning (for example, a recognizable word)

• Character Sequence: an ordered combination of oneor more characters

• Character: a letter or symbol used within naturallanguage (represents no meaning by itself)

While a corpus, document, and field are common conceptswithin the field of information retrieval, concrete examples ofterm sequences, terms, character sequences, and characters arepresented in Figure 1 for further explication. In this figure,you can see that the term sequence software engineering iscomposed of the ordered sequence of the two terms softwareand engineer, and that the word engineering contains thecharacter sequence engineer, which contains the characters e,en, eng, ... engineer, etc.

Our goal is to automatically generate a knowledge graphfrom an underlying corpus of documents. In order to avoidthe previously mentioned pitfalls with manually generated

ontologies and ontology learning systems, we need a wayto fully preserve these nuanced semantic interrelationshipsembedded within a corpus of textual documents.

Our overarching goal is not simply to link entities withknown relationships, however, but to actually present theability to discover any arbitrarily-complex relationship be-tween entities within the domain. Consider a typical searchengine, where a user can query any keywords, phrases,or arbitrarily-complex combinations of character sequences,terms, or term sequences. We would like to be able totraverse our automatically-generated knowledge graph andinstantly understand the nuanced meaning(s) represented bythese arbitrarily-complex natural language queries.

To understand the significance of this goal, let’s considerthe way in which the meaning of terms is modified giventheir context. The term engineer has a well-known abstractmeaning, but when found inside the phase software engineer,it takes on a much more limited interpretation. Similarly, theword driver takes on two entirely different meanings whenfound near terms relating to computers (a hardware driver)versus in contexts related to transporting goods (truck driveror delivery driver). While we tend to think of most termsand phrases as having a limited number of meanings, it is farmore accurate to think of them as having a slightly differentmeaning in every unique context in which they are found.While terms and phrases usually share strong similarities intheir intended meanings across contexts, by allowing boththose strong similarities, as well as nuanced differences tosurface during node traversals, we are able to discover themost important interrelationships between entities in any givencontext and thus much better represent the intended knowledgedomain. Our Semantic Knowledge Graph model provides acompact representation of an entire knowledge domain (asrepresented within a corpus of documents) which accomplishesthese goals.

B. Model Structure

Consider an undirected graph G = (V,E) where V andE ⊂ V × V denote the sets of nodes and edges, respectively.We define the following:

• D = {d1, d2, ..., dm} is a set of documents thatrepresent a corpus that the Semantic Knowledge Graphwill utilize to extract and score semantic relationships.

• X = {x1, x2, ..., xk} is a set of all items stored inD. These items could be keywords, phrases, or anyarbitrary linguistic representation found within D.

• di = {x|x ∈ X} where we can think of eachdocument d ∈ D as a set of items.

• T = {t1, t2, ...tn} where ti is a tag which assigns anentity type to an item such as keyword, title, location,company, school, person, etc.

Given the previous notations, the set of nodes V in our graphcan be defined as V = {v1, v2, .., vn} where vi stores an itemxi ∈ X tagged with tag tj ∈ T . While Dvi

= {d|xi ∈d, d ∈ D} is a set of documents that contains item xi with itsappropriate tag tj . Finally, we define eij as an edge between(vi, vj) with a function f(eij) = {d ∈ Dvi ∩ Dvj} thatstores on each edge the set of documents that contain bothitems xi and xj with their tags. On the other hand, we define

g(eij , vk) = {d : d ∈ f(eij) ∩ Dvk} that stores on the edgeejk the common set of documents between f(eij) and Dk.

C. Materialization of Nodes and Edges

Core to the SKG model is the idea that a layer of indi-rection exists between any two nodes vi and vj and the edgeeij that connects them. Specifically, instead of nodes beingdirectly connected to each other through explicit edges, nodesare instead connected bidirectionally to documents, such thatthe edge eij between node vi and vj is said to materializewhenever |f(eij)| > 0.

Thus, in order to traverse the graph from source node vito destination node vj , our system requires a lookup indexlinking node vi to a set of documents, as well as a separatelookup index which can map from those documents to nodevj or other nodes to which a traversal may need to occur.We refer to this first index as our terms-docs inverted index,and to the second as our docs-terms uninverted index, bothshown in Figure 2 (a). These two indexes enable us to modelall terms as nodes within the graph and to materialize andtraverse from any node to any other node through the sets ofshared documents between the nodes, as shown in Figure 2(b).

Because edges materialize during graph traversal basedupon an intersection of documents to which both nodes areconnected, this means that we can form an edge between anyentity that is representable by an underlying set of documentsto which it is linked. Thus, instead of being restricted to onlyusing predefined entities from our terms-docs index, it is alsopossible to dynamically materialize new nodes on the fly basedupon any combination of terms, as shown in Figure 2 (c).

Since complex representations of entities can be mate-rialized as nodes from arbitrary combinations of existingterms, this enables us to also decompose complex entities intoindividual terms (with positional relationships) for persistencein the underlying terms-docs inverted index. Through thisprocess of decomposing our corpus into individual terms,the documents in which the terms appear, and the positionsin those documents where the terms appear, we can thuscreate a highly compressed and lossless representation ofevery relationship within our original corpus. Then, at traversaltime, we can materialize nodes representing any representationfound within the original corpus, as well as edges connectingany materialized or predefined nodes to other nodes.

D. Scoring Semantic Relationships

The Semantic Knowledge Graph (SKG) is able to scoreand represent the strength of the semantic relationship betweenentities on the edge connecting them. For example, if wedon’t know how semantically related the keyword java is tothe keyword hadoop, we can utilize the SKG to score therelationship between these two terms. To score a semanticrelationship between item xi and item xj using the SKG, wematerialize source node vi (holding the documents linked to byxi) and destination node vj (representing the set of documentscontaining xj).

The simple use case for scoring semantic relationships isto score directly connected nodes vi and vj . In this case wequery the terms-docs inverted index for item xi tagged with tj ,

and as a result we get back Dvi. Then we query the terms-docsinverted index again for xj tagged with tk to get Dvj . An edgeeij will be created between vi and vj if f(eij) 6= φ. We callthe Dvi our foreground document set DFG, while DBG ⊆ Dis our background document set. The hypothesis behind ourscoring technique is that if xi tends to be semantically relatedto xj , then the presence of xj in the foreground documentset DFG should be above the average presence of xj in DBG.We utilize The z score to evaluate this hypothesis:

z(vi, vj) =y − n ∗ p√n ∗ p(1− p)

Where n = |DFG| is the number of documents in ourforeground document set, p =

|Dvj|

|DBG| is the probability offinding the term xj with tag tk in the background documentset, and y = |f(eij)| is the number of documents containingboth xi and xj .

In many cases, we will want to traverse multiple levels ofdepth n > 2 to find and score relationships between more thanjust two nodes. For example, we may traverse from the entityjava to big data to hadoop, such that the weight assignedto the edge between big data and hadoop would be moremeaningful if it were also conditioned upon the the path it tookto arrive at big data through java. Our system accomplishesthis across n nodes and a path P = v1, v2, .., vn, where eachnode stores an item xi with its tag tj . To apply the samez(vi, vj) between nodes, but conditioning this score basedupon the entire path P , the only changes are

DFG =

f(eij) if n = 3

{n−3⋂

i=1,j=i+1,k=j+1

g(eij , Dvk)} if n > 3

while y = |DFG∩Dvn |. We normalize the z score using asigmoid function to bring the scores in the range [−1, 1]. Wecall the normalized score the relatedness score between nodeswhere 1 means completely positively related (likely to alwaysappear together), while 0 means no relatedness (just as likelyas anything else to appear together), and -1 means completelynegatively related (unlikely to appear together).

While the relatedness score provides a weight on eachedge corresponding to the strength of the semantic relationshipbetween two nodes, since this score is calculated at traversaltime, it is also possible to substitute in other scoring functionsdepending upon the use case at hand. Popularity (total countof overlapping documents) is another function that may beappropriate for simpler use cases, for example.

E. Discovering Semantic Relationships

The SKG is very powerful at surfacing hidden relationshipsbetween nodes in real time. Furthermore, this model enablesmaterialization of nodes and extraction of relationships usingthose materialized nodes. In order to discover related itemswith a specific tag tk to an item xi with tag tj , we startby querying the inverted index for the item xi, which weassign as node vi corresponding with document set Dvi . Wequery the docs-terms uninverted index for the tag tk and westore the retrieved documents as Dtk = {d|x ∈ d, x : tk}.We define Vvi,tk = {vj |xj ∈ d, d ∈ Dtk ∩ Dvi} where vj

(a) (b) (c)

Fig. 2. (a): Two indexes needed to power the Semantic Knowledge Graph. The SKG leverages two indexes per field, a docs-terms uninverted indexmapping documents to the terms they contain, and a terms-docs inverted index mapping every term in the corpus to a postings list of documents containing theterm and the positions of the term within each document. (b): Materialization of edges using shared documents. Only terms which share documents havean edge, and the weight of the edge will be later calculated based upon the statistical distribution of shared documents. (c): Materialization of new nodesthrough shared documents. New nodes can be dynamically formed (materialized) through specifying any arbitrary combination of character sequences, terms,and term sequences, finding the underlying document set they all match, and leveraging this document set for subsequent traversals.

is a node that stores an item xj , and we define Vvi,tk asthe set of nodes that stores items with potential relationshipwith xi of type tk (See Figure 3 (a)). Finally, we apply∀vj ∈ Vvi,tk , relatedness(vi,vj) to score the relationshipbetween vi and vj , which enables us to rank those relationshipsand pick the top m relationships or define a threshold t toaccept only relationships with relatedness(vi, vj) > t. Thisoperation of relationship discovery can occur recursively, asshown in Figure 3 (b), to discover and drill into multiple levelsof relationships.

IV. SYSTEM IMPLEMENTATION

Our implementation of the Semantic Knowledge Graphleverages the Apache Lucene/Solr search engine for manyof its needed data structures. The data structures leveragedinclude the underlying inverted index that is used to find pre-existing nodes, the document set intersection logic necessary tomaterialize new nodes and edges from the terms-docs invertedindex, and the docs-terms uninverted index that is necessaryto traverse across the edges materialized between nodes.

Since Apache Solr serves as a web server, we leverage itas a framework to expose a RESTful API around our SKGimplementation.

In order to build the knowledge graph, one simply needsto send a corpus of documents to the Semantic KnowledgeGraph API. These documents will contain one or more fields,typically with at least one field contain raw text, and optionallywith one or more additional fields containing some morestructured information about the document. For the use case ofemployment search, for example, we could use the commandin Table I to add some job postings to the SKG.

The most important thing to note here is that documentsare added to the graph, but no explicit relationships betweenentities need to be modeled. Instead, the SKG will later allowus to discover relationships between entities - in this case jobtitles, skills, and keywords - through statistical analysis of howthose entities are found together or absent across the entire

curl −H 'Content−type:application/json'http://localhost:8983/solr/semantic−knowledge−graph/update −d'[{ "id" : "job1",

"title" : "Data Scientist","skills": ["machine learning","spark"],"keywords": "Seeking a senior−level data scientist with experience with

spark and machine learning..."},{ "id" : "job2",

"title" : "Registered Nurse","skills": ["er","trauma", "phlebotomy"],"keywords": "Come join the top−rated hospital in the region..."}

]'

TABLE I. ADDING DOCUMENTS TO THE SKG

corpus of documents. Figure 3 (c) visually demonstrates howthe underlying data structure and the intersection of sets ofdocuments work together to form a traversable graph model.

Once an entire corpus of documents has been loaded intothe SKG, we can now issue queries to the system to traverseand score the relationships between entities. Table II shows anexample query and response from the graph.

This request asks the system to find the top job titleassociated with the phrase data science, and then to find up tothree skills, including java, sorted by how similar they are tothe job title data science (the previous node in the traversal).

By making our implementation of the SKG model a pluginfor Apache Solr, we were able to leverage a pre-built invertedindex, an uninverted index, as well as a rich set of text analysislibraries (tokenizers and token filters) to model documents.This allowed us to focus on the graph semantics, document setintersections, scoring models, and graph traversal API requiredto implement the SKG without needing to reimplement most ofthe already well-studied information retrieval structures uponwhich the SKG relies. Instead of re-implementing a queryparser, this also allowed us to make full use of existing toolsto map character sequences, terms, and term sequences intotheir underlying document set representations.

Since an inverted index can be implemented using multipleunderlying data structures, this also allows us to easily lever-age highly efficient and compressed data structures, such as

(a) (b) (c)

Fig. 3. (a): Graph representation of node traversal. Models a graph traversal across edges of a specific relationship type (tag). The first traversal is from thestarting node of Java to each node for which it holds a has_related_skill edge. The second traversal is to each subsequent node for which a has_related_job_titleedge is found. (b): Multilevel graph traversal. This example traverses from a materialized node, through all has_related_skill edges, then from that level ofnodes again through each of their has_related_skill edges, and finally from those nodes to each of their has_related_job_title edges. The weights are calculatedusing the entire traversed path in this example, though it is also possible to consider calculate weights independent of the path using only each pair of directlyconnected nodes. (c): Three representations of a traversal. The Data Structure View represents the underlying links from term to document to term in ourunderlying data structures, the Set Theory view shows the relationships between each term once the underlying links have been resolved, and the Graph Viewshows the abstract graph representation in the semantics exposed when interacting with the SKG.

Request Response

{ "starting_node": ["keywords:\"data science\""],

"nodes": [{ "type": "job_title",

"limit": 1,"discover_values": true,"nodes": [{ "type": "skills",

"limit": 3,"discover_values": true,"values": [ "java"]}]}]}

{ "nodes": [{"type": "job_title","values": [{

"name": "Data Scientist","relatedness": 0.989,"popularity": 86.0,"fg_popularity": 86.0,"background_popularity": 142.0,"nodes": [{

"type": "skills","values": [

{ "name": "Machine Learning","relatedness": 0.97286,"popularity": 54.0,"foreground_popularity": 54.0,"background_popularity": 356.0 },

{ "name": "Predictive Modeling","relatedness": 0.94565,"popularity": 27.0,"foreground_popularity": 27.0,"background_popularity": 384.0 },

{ "name": "Artificial Neural Networks","relatedness": 0.94416,"popularity": 10.0,"foreground_popularity": 10.0,"background_popularity": 57.0 },

{ "name": "Java","relatedness": 0.76606,"popularity": 37.0,"foreground_popularity": 37.0,"background_popularity": 17442.0

}]}]}]}]}

TABLE II. SAMPLE GRAPH TRAVERSAL REQUEST

Lucene’s Finite State Automata/Transducers, for more efficientcompression and traversal of nodes within the SKG. [17]

V. EXPERIMENTS AND RESULTS

While the SKG is a generally applicable model to any do-main representable by documents with overlapping referencesto the same entities, we focused our testing on use cases withinthe job search domain, leveraging datasets provided to us byCareerBuilder, one of the largest job boards in the world.

For our experiments, we leveraged two datasets: 1) acollection of 3 million job postings, and 2) a collection of1 million job seeker resumes containing a total of 3 million

employment history sections (representing prior jobs held bya given job seeker). While these two datasets could havebeen combined into a single graph, we only had the need touse a single dataset at a time and therefore maintained eachdataset in a separate SKG for the following experiments. All ofour experiments leveraged the SKG implementation describedin the System Implementation section, which has been opensourced along with the publication of this paper.

In terms of performance, the SKG was able to easilytraverse through and gather millions of nodes in just a fewmilliseconds (on commodity servers) in our experiments whenthe relatedness score was not needed. For most of thesesame queries tested utilizing the relatedness score, the SKGrequest completed in tens to hundreds of milliseconds, thoughsome very intensive queries traversing multiple levels ofnested-relationships were observed to take several seconds tocomplete. Observed times are thus quite fast considering thedynamic nature of the node and edge materialization, makingthe SKG suitable for integration in real-time search and naturallanguage processing systems.

A. Data Cleansing

Data Scientists spend a considerable amount of their time -60% according to a 2016 survey - cleaning and organizing datasets [18]. Most datasets contain some dirty data, particularlywhen free text content is involved. While we have previouslydescribed how the SKG is able to discover relationshipsembedded within a corpus of documents, it is just as goodat ranking user-supplied relationships.

Use Case: As an example use case, we leveraged the SKGto clean a list of relationships mined from search engine querylogs using a similar methodology to that described in [19, 20].The idea here is that users who conduct similar searches oftensearch for related terms and phrases. For example, someonewho searches for registered nurse will often also search for RN,nurse, ER, hospital and so on. Someone who searches for javawill often also search for software engineer, java developer,and so on. After obtaining a list of search terms mapped to

TABLE III. SAMPLES FOR THE CO-TERMS CLEANED BY SKG

Term Co-term Blacklisted?system support it manager Yessenior buyer customer service manager Yesleasing consultant manufacturing manager Yesprogrammer engineering manager Yesproduct requirement documents sows Noevents wedding coordinator Noelectrical engineering cad designer No

their co-occurring terms, we decided to use the SKG to find theweight of the edge between each pair of co-occurring terms.

Since the relatedness score is normalized between -1 (per-fectly negatively related), 0 (no relationship), and 1 (perfectlypositively related), we use 0.5 as our threshold betweenwhether something is likely to have a strong relationship (0.5to 1.0) or likely to have a weak relationship (0 to 0.5).

Experiment Setup: To setup our experiment, we indexed3 million job postings into an SKG. We then used this SKGto traverse 2.26 million co-term pairs, traversing across theshared has_related_term edge between the nodes for eachterm. For each traversal, we analyzed the weight of the edge(the relatedness score), and we added all term pairs with arelatedness score below 0.5 to a blacklist. The end result wasthe blacklisting of 78% of the co-term pairs (1.77 millionblacklisted). Table III shows some examples of co-term pairswhich were kept and which were blacklisted.

Results: From the blacklist generated from the SKG, weasked an independent data analyst to randomly select 500 ofthese blacklisted pairs and tag them as either related or notrelated. The threshold for something being related, in this case,is whether the data analyst believed a user would wish to seethe co-term suggested in a search experience when performinga search for the original term. As a result of his analysis, theData Analyst determined that 25 of the 500 blacklisted termswere actually related, while 475 were correctly identified bythe SKG as not related. The final results thus showed thatthe SKG removed 78% of the terms while maintaining a 95%accuracy at removing the correct noisy pairs from the inputdata.

B. Predictive Analytics:

The SKG is also effective at performing predictive analyticsin order to estimate future behavior based on analysis of pastbehavior.

Concretely, given a set x1 = {x11 ...x1m} through xn offeature vectors describing a state at time t1 through tn, wewould like to predict likely subsequent states at times tn +1...tm.

With a modified scoring function, the SKG materializesnodes which can be interpreted as either consequent or an-tecedent of an association rule, with edge scores correspondingto the confidence of these rules [21].

Career Pathing Use Case: We used the SKG againstresume data to characterize and predict employment histories.The tag types indexed include features such as extracted skillsand keywords, normalized job titles, job level, location, andduration (in months) of employment. The SKG generatesestimates characterizing a hypothetical next job in terms ofcombinations of node types indexed. For example, we cangenerate the maximal confidence consequent of the next job

Fig. 4. Predictive analytics (consequent scoring). Assume a jobseekerhas a job title of Logistics Manager, the skill of Distribution (Business), andadditionally some experience with the keyword purchasing. The figure showsthis starting materialized node with its support on the left. The figure highlightsthe results for the top five predicted job titles with the middle circles, withthe highest confidence job title being Senior Buyer with a confidence of 0.09.The top skills are predicted jointly with the job title in the circles on the right,with Operations as the highest confidence skill, with confidence of 0.025.

title and the skill most likely to be used at the next job.The SKG implementation also allows support and confidencethresholding through a normalized min_count parameter. Ad-ditionally, with the ability to materialize edges and nodesusing query parameters, the SKG allows for much more fluidconstruction of antecedents and consequents.

Experiment Setup: We tested the SKG’s predictive capabil-ities using resumes from one million job seekers. We parsedand extracted the tag types of location, job level, job title,skills, and keywords for the most recent three employmenthistory entries of each resume. For each tag set (correspondingto an employment history), we index each tag type appendedwith an index indicating recency. We additionally use conse-quent scoring, allowing edge scores to be interpreted as theconfidence of association rules, with the new scoring functionc(vi, vj) =

|DFG

⋂Dvj|

|DFG| . Prediction proceeds by materializinga node encoding the predictor features (which must be lessrecent than the most recent tag set), then traversing the graphthrough the next_most_recent_t edge for an arbitrary tag t.

Results: Using data on what career paths thousands of otherjob seekers have taken, we can answer the question "Givenmy current position and skills, what are my next most likelypositions?” Figure 4 demonstrates an example answer.

The most direct application of this predictive capability isin providing recommendations for job seekers looking to takethe next step in their careers. As of the time of writing, sucha system was still in a research phase. Our approach relieson the dynamic edge materialization of the SKG to discoverviable job titles, which are then filtered by compensation (andin the future, experience) constraints to ensure recommendedjobs represent a step forward.

Search Expansion Use Case: Recruiter search expansionrepresents another application of the SKG’s predictive capabili-ties. A common problem when recruiting for high-demand jobsis a scarcity of applicants. Searching for applicants who matchthe skills and title of an in-demand job may be too restrictive,but recruiters don’t always have the domain knowledge toexpand their search for fitting candidates. A semantic searchengine represents an adequate solution to this problem, butwithout the concept of career progression, semantic searchignores a pool of trainable candidates. Given an originalcandidate search, q0, we consider the problem of expanding the

Fig. 5. Predictive analytics (antecedent scoring). An example of queryexpansion for a q0 of skills:Java*. The nodes joined by edges on the middleand right form the combined antecedent, with the query result set formingthe consequent. The top number on the rightmost nodes equals the confidenceof the combined antecedent → starting node rule, while the top number forthe middle column represents the confidence of the single-item antecedent →starting node rule. Correspondingly, the bottom number indicates the support(times one million) for each rule.

query while retaining the highest possible probability that theadditional candidates represent a "good fit”. One measure oftrainability is the probability that the candidate would advanceto match q0 in their next job independent of the recruiter.By this definition, our problem can be reduced to finding amaximal confidence antecedent of a given starting node.

Experiment Setup: We tested the search expansion capabil-ities of the SKG using the same career path corpus containingone million resume examples. We modify our scoring functionagain, to an antecedent scoring function, which evaluates theconfidence of a rule defined Vk → v1, where v1 is the startingnode and Vk the set of nodes traversed up to the index k(excluding v1). Given a path P = v1, v2, ...vi:

a(vi, vk) =

|Dvk

⋂DFG|

|DBG| if vi is a starting node

|Dvk

⋂DFG|

|j=i⋂j=2

(Dvj)⋂

DBG|otherwise

In order to isolate the effect of career progression wemodified our materialized starting node by explicitly excludingexamples that matched the query in earlier employment historyentries. We then traverse the graph along the has_less_recent_tedge for arbitrary tag type t.

Results: Figure 5 shows the results for an example query.Note the relatively high confidences for the distantly relatedjob titles and skills, which are unlikely to be returned by asemantic search engine. Although applications for this use caseare still in development, our approach would be to use the SKGto generate expansions, which can then be selectively appliedbased on confidence and support thresholds.

C. Document Summarization

Another interesting application of the SKG is the identi-fication of the most important topics within a document. Inany given text document, some words are going to be highly-related to the topic of the document, while others will beunimportant. With the SKG, we can score every entity withina document to determine its significance to the topic of thedocument. This takes us from a full text document to a muchmore compressed summarization of the document includingonly its most important components.

Experiment Setup: We indexed 3 million job posting doc-uments into the SKG implementation discussed in the SystemImplementation section.

Request Response

Job Title: Big Data Engineer

REQUIREMENTS:Bachelor’s degree in Computer Science or related

discipline...2+ years of hands−on implementation experience(

preferably lead engineer) working with acombination of the following technologies:Hadoop, Map Reduce, Pig, Hive, Impala, ...

IDEAL ADDITIONAL EXPERIENCE:Strong knowledge of noSQL of at−least one noSQL

database like HBase and Cassandra.3+ years’ programming/scripting languages Java and

Scala, python, R, Pig2+ years’ experience with spring frameworkExperience in developing the full life−cycle of a Hadoop

solution. This includes creating the requirementsanalysis, design of the technical architecture,design of the application design and development,testing, and deployment of the proposed solution...

Understanding of Machine Learning skills (like Mahout)Experience with Visualization Tools such as Tableau...

data engineer 0.96hive 0.82pig 0.82

hadoop 0.8mapreduce 0.79

nosql 0.71hbase 0.66

impala 0.6python 0.56

cassandra 0.56scala 0.56

machine learning 0.49tableau 0.39mahout 0.37

analytics 0.36java 0.36

TABLE IV. DOCUMENT SUMMARIZATION

Our goal was to then take a new document not alreadyrepresented in the graph and to have the graph score howrelated each of the entities found within the new documentis to the document itself. While we could have used theindividual keywords within the document as our starting nodes,we instead employed an entity extractor on the document asa preprocessing step. The purpose of leveraging the entityextractor was so that we could work with phrases as ournodes (e.g. senior software engineer and registered nurse)as opposed to only single keywords (e.g. senior, software,engineer, registered, and nurse).

Our next step was to specify a foreground query (whichyields the set of documents DFG from our model), which inthis case should represent the topic of our document. Becauseour corpus was composed of job posting documents, whichhave job titles and descriptions, we are able to simply leveragethe job title of the document as our foreground query. In otherscenarios where no category for the document is known apriori, it is possible to instead leverage other statistics from theterms-docs inverted index, such as tf-idf scoring of each termwithin the document, to find the set of most globally interestingterms within the document [22]. This list of terms can then beused to materialize a foreground query that combines the topmost globally interesting terms found within the document.

Once we generate our foreground query (the topic ofthe document), we then send each of the phrases from thedocument to the SKG, asking the graph to score how relevantthey are to the topic of the document.

Table IV demonstrates an example document run throughthe SKG. While parts of the document were omitted for thesake of space, you can see that the top scoring nodes returnedfrom the SKG are an excellent summarization representingthe most important entities within the document. If someonewanted a quick overview about what this job posting isabout, reviewing this ranked list of phrases would providea very condensed summary. Furthermore, one could requestadditional traversals from this summary list and find the mostrelated other nodes which were not actually present in the

original documents.

Such document summarization has many applications. Thesummarized list of nodes can be sent to an information retrievalengine to build a document-based query, which creates a formof content-based recommendation engine. You could use theweights of each nodes to highlight the important sections ofa document, or use the next traversal to suggest additionalrelated terms for a document as its author is writing it.

VI. FUTURE WORK

While we have designed the SKG model, created andopen sourced a reference implementation, and tested severaluse cases, there are many extension points worthy of futureresearch and exploration.

Our implementation of the SKG is able to both identifypredefined nodes, as well as materialize new nodes and edgeson the fly. While we have implemented two general-purposescoring mechanisms for assigning weights to edges (popularityand relatedness) and have implemented another more special-ized scoring mechanism described in the career pathing usecase section, each edge scoring algorithm must be coded intothe system today. A future extension of our implementation isto allow users to specify functions as part of the graph querysyntax such that they can apply arbitrarily complex scoringcalculations without the need to write custom code.

Additionally, whereas today all documents matching a termor query are included in materialized nodes and edges (evenif they are only tangentially related to the document in whichthey were found), we believe that by first scoring all documentsmatching each node (for example, using a tf-idf score) and onlyleveraging the top n documents from each node when scoring,that we could further improve the system’s ability to identifyhighly-related nodes and to filter out noisy edges.

Semantic Search: One of the key future use cases wherewe intend to apply the SKG model is for query interpretationand expansion. We have already shown that the SKG worksquite well for identifying the most conceptually similar termsfor a given term/phrase/entity within a domain. This canbe used to automatically expand a search query to performa conceptual search instead of an exact match search. Forexample, if someone searches for cdl, then the query couldbe automatically expanded to something like cdl OR "truckdriver" OR freight OR "commercial drivers license". Oneparticularly interesting aspect of using the SKG for this taskis that not only can it identify what each individual searchterm means, but it can actually identify which terms aresemantically-related to the entire query. Thus, if someonesearches for driver AND windows, the SKG can return adifferent set of keyword expansions for the term driver thanif the user had searched for driver AND truck. This deviatesfrom traditional taxonomy approaches, which often rely onfixed meanings of each word, whereas the SKG can infusenuance and contextual interpretation of search terms.

Search Engine Relevancy Algorithms: There are also sev-eral options for leveraging the technique mentioned in thedocument summarization use case to calculate and index thesignificance of each term in each document and use thatinformation as part of the search engine’s relevancy ranking al-gorithm. Such a probabilistic relevancy ranking function could

likely achieve measurable gains over more traditional modelswhich only consider the number of occurrences of terms asopposed to their conceptual significance to the document.

Trending topics (time-series): Another application of theSKG is the identification of trends over time. This could beconceptually described as doing anomaly detection where theforeground and background sets are time frames as opposed tocategories or keywords. For example, if you were analyzing anews feed or stream of social media posts, you could specify abackground query of "this month" and a foreground query of"this week" or "today" to see articles or categories which areoccurring with a higher-than expected likelihood. This wouldallow you to identify trending topics, and is a useful additionaluse case for future exploration with the SKG model.

Recommendation Systems: Most recommendation systemsleverage behavioral-based collaborative filtering, which suffersfrom the cold-start problem (items which have not yet beenreviewed by enough users will not be recommended). In orderto overcome this, it is often helpful to also have a content-based recommendations approach. The SKG can be leveragedto identify the most significant features of a document (aspreviously described in the document-summarization use case),in order to use those to match other documents sharing thosesame features. The SKG can also be leveraged to betterunderstand the users for which recommendations are beingmade by inspecting the known information about them fromtheir previous interactions with the system as compared withother users. For example, if a user ran multiple searches withina search engine before, the SKG can be used to determine theintersection and overlap between those searches (a materializednode) and to traverse to the other nodes that are most relatedto the combined search history of the user. Further, the SKGcould be used to predict interests of users based upon howtheir behavior compares to other users. The career pathinguse case described previously is a good example of this,where we could inspect a job seekers employment historyand current job searches to determine, based upon other jobseekers’ typical behavior, when the user is most likely to switchjobs, and to what kind of job he/she would be willing toswitch. This information could then be used as a feature inthe job recommendation algorithm, and this feature would alsochange to more heavily favor a progression in job seniorityas time progresses. Depending upon the domain, there arenumerous ways to leverage the SKG by leveraging its abilityto materialize and score the nuanced relationships betweenarbitrary entities.

Root Cause Analysis: The SKG is a good candidate forfuture research as a root cause analysis tool. Many companiesmaintain ticketing systems or online help forums throughwhich they receive questions, bug reports, and/or complaints.The SKG could be used, for example, to find posts bycustomers matching a specific criteria (i.e. look for nodes/termslike frustrated, broken, or refund) and find out which otherwords or topics are most statistically related. If a companysold a software system, this would be an easy way to determinewhich parts of the system needed the most attention.

Abuse Detection: Given a system where a few users partakein abusive behavior (posting spam, programmatically crawlingthe website, etc.), you could index the behaviors of users,find some abusive users, and use them as your foreground

set to find other users who exhibited statistically similarbehavior. In the spam use case, you could also tag your originaldocuments with spam or not spam and set spam documents asyour foreground set DFG. Assuming your documents containtextual content, you could then identify nodes/terms morecommonly found among spam postings, and use this detectionas the basis of a spam classifier for new postings as they arereceived. This is one of many forms of anomaly detection, acategory of use cases for which the SKG is particularly wellsuited for future research and applications.

VII. CONCLUSION

In this paper, we have introduced a novel kind of knowl-edge discovery model, which we are calling the SemanticKnowledge Graph. This system enables the automatic creationof a graph which encodes statistical relationships between allkeywords, phrases and entities represented within free text andsemi-structured documents, allowing those relationships to betraversed and scored based upon strength of the relationshipwithin a specific domain. This auto-generated graph can thenbe queried in real time to discover the nuanced relation-ships between any combination of linguistic representations(keywords, phrases, etc.) or structured data (titles, categories,dates, numbers, etc.). Unlike traditional graph databases whichperform either a depth-first search or a breadth-first search ofall nodes, because the Semantic Knowledge Graph materializesedges between nodes and assigns their weights on the fly basedupon either count of overlapping documents or relatedness ofnodes within a corpus of documents, the Semantic KnowledgeGraph can use these weights to only traverse the top scoringedges. This turns the graph traversal process into one indexlookup and set intersection per level of depth of the traversal,making the Semantic Knowledge Graph highly efficient attraversing millions or even billions of nodes, as long as onlythe highest weighted nodes are collected at each level.

The Semantic Knowledge Graph has numerous applica-tions, including automatically building ontologies, identifica-tion of trending topics over time, predictive analytics on time-series data, root-cause analysis surfacing concepts related tofailure scenarios from free text, data cleansing, document sum-marization, semantic search interpretation and expansion ofqueries, recommendation systems, and numerous other formsof anomaly detection. The main contribution of this paperis the introduction of the the Semantic Knowledge Graph, anovel and compact new graph model that can dynamicallymaterialize and score the relationships between any entitiesrepresented within a corpus of documents.

ACKNOWLEDGMENT

The authors would like to thank CareerBuilder for theirsponsorship of this research and development. In particular,the authors thank Daniel Crouch, David Bernal, Lamar Payson,and Jacob Maggio, who also contributed their ideas and codereviews during the development of the semantic knowledgegraph, as well as Colin Field, Matt McNair, Eric Presley, andAbdel Tefridj for their support of the development and opensourcing of the referenced implementation.

REFERENCES

[1] Tom Gruber. Ontology. In Ling Liu and M. Tamer Özsu, editors,Encyclopedia of Database Systems. 2009.

[2] Chris Biemann. Ontology learning from text: A survey ofmethods. In LDV forum, volume 20.

[3] George A Miller. Wordnet: a lexical database for english.Communications of the ACM, 38(11).

[4] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, andJamie Taylor. Freebase: a collaboratively created graph databasefor structuring human knowledge. In Proceedings of the ACMGMOD/PODS.

[5] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, DimitrisKontokostas, Pablo N Mendes, Sebastian Hellmann, MohamedMorsey, Patrick van Kleef, Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia.Semantic Web, 6(2).

[6] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, andGerhard Weikum. Yago2: A spatially and temporally enhancedknowledge base from wikipedia. Artificial Intelligence, 194.

[7] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum.Yago: a core of semantic knowledge. In Proceedings of WWW.

[8] Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik.Deepdive: Web-scale knowledge-base construction using statis-tical learning and inference. VLDS, 12.

[9] Antoine Zimmermann, Christophe Gravier, Julien Subercaze,and Quentin Cruzille. Nell2rdf: Read the web, and turn it intordf. In KNOW@ LOD.

[10] Ndapandula Nakashole, Martin Theobald, and Gerhard Weikum.Scalable knowledge harvesting with high precision and highrecall. In Proceedings of WSDM 2011.

[11] Natalya Fridman Noy, Mark A Musen, et al. Algorithm and toolfor automated ontology merging and alignment. In Proceedingsof the AAAI 2000., 2000.

[12] Meghan Daly, Fletcher Grow, Mackenzie Peterson, JeremyRhodes, and Robert L Nagel. Development of an automatedontology generator for analyzing customer concerns. In Systemsand Information Engineering Design Symposium (SIEDS), 2015.

[13] AA Krizhanovsky and AV Smirnov. An approach to automatedconstruction of a general-purpose lexical ontology based onwiktionary. Journal of Computer and Systems Sciences Inter-national, 52(2).

[14] Chang-Shing Lee, Yuan-Fang Kao, Yau-Hwang Kuo, and Mei-Hui Wang. Automated ontology construction for unstructuredtext documents. Data & Knowledge Engineering, 60(3).

[15] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn,Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun,and Wei Zhang. Knowledge vault: A web-scale approach toprobabilistic knowledge fusion. In Proceedings of KDD 2014.

[16] Roberto Navigli and Paola Velardi. Learning domain ontologiesfrom document warehouses and dedicated web sites. Computa-tional Linguistics, 30(2).

[17] Jan Daciuk and Dawid Weiss. Smaller representation of finitestate automata. In Implementation and Application of Automata.

[18] Crowdflower data science report 2016.http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf,2016.

[19] Khalifeh AlJadda, Mohammed Korayem, Trey Grainger, andChris Russell. Crowdsourced query augmentation throughsemantic discovery of domain-specific jargon. In IEEE Big Data2014.

[20] K. AlJadda, M. Korayem, C. Ortiz, T. Grainger, J. A. Miller, andW. S. York. Pgmhd: A scalable probabilistic graphical modelfor massive hierarchical data problems. In Big Data (Big Data),2014 IEEE International Conference on, pages 55–60, Oct 2014.

[21] Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh.Algorithms for association rule mining-a general survey andcomparison. ACM sigkdd explorations newsletter, 2(1).

[22] Juan Ramos. Using tf-idf to determine word relevance indocument queries. In Proceedings of the first instructionalconference on machine learning, 2003.

Documents

The Semantic Knowledge Graph: A compact, auto … · each items is to the list, ... indirection exists between each node and the edge connecting ... Graph represents a novel new graph