The Semantic Knowledge Graph

  • Published on
    07-Jan-2017

  • View
    458

  • Download
    0

Embed Size (px)

Transcript

PowerPoint Presentation

The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain

Trey GraingerSVP of EngineeringLucidworksKhalifeh AlJaddaLead Data ScientistCareerBuilderMohammed KorayemData ScientistCareerBuilderAndries SmithSoftware EngineerCareerBuilder

AbstractThis paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.

Trey Grainger SVP of Engineering

Previously Director of Engineering @ CareerBuilderMBA, Management of Technology Georgia TechBA, Computer Science, Business, & Philosophy Furman UniversityInformation Retrieval & Web Search - Stanford University

Fun outside of CB: Co-author of Solr in Action, plus a handful of research papersFrequent conference speakerFounder of Celiaccess.com, the gluten-free search engineLucene/Solr contributor

About Me

Terminology / Background

A Graph DSAA 2016 Montreal Quebec Canada

Semantic Knowledge Graph PaperTrey Grainger

Mohammed Korayem

Andries Smith

Khalifeh AlJaddapresenter_atauthor_ofauthor_ofauthor_ofauthor_ofin_provincein_cityin_country

Node / Vertex

Edge

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene.

Key Solr Features:

Multilingual Keyword searchRelevancy Ranking of resultsFaceting & AnalyticsHighlightingSpelling CorrectionAutocomplete/Type-ahead PredictionSorting, Grouping, DeduplicationDistributed, Fault-tolerant, ScalableGeospatial searchComplex Function queriesRecommendations (More Like This) many more

*source: Solr in Action, chapter 2

TermDocumentsadoc1 [2x]browndoc3 [1x] , doc5 [1x]catdoc4 [1x]cowdoc2 [1x] , doc5 [1x]...oncedoc1 [1x], doc5 [1x]overdoc2 [1x], doc3 [1x]thedoc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

DocumentContent Fielddoc1 once upon a time, in a land far, far awaydoc2the cow jumped over the moon.doc3 the quick brown fox jumped over the lazy dog.doc4the cat in the hatdoc5The brown cow said moo once.

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):The inverted index

/solr/select/?q=apache solrTermDocumentsapachedoc1, doc3, doc4, doc5hadoopdoc2, doc4, doc6solrdoc1, doc3, doc4, doc7, doc8

doc5doc7 doc8doc1 doc3 doc4solrapacheapache solrMatching queries to documents

Related Work

Knowledge GraphRelated WorkPrimarily related to ontology Learning.

Recently, large-scale knowledge bases that utilize ontologies (FreeBase [4], DBpedia [5], and YAGO [6, 7]) have been constructed using structured sources such as Wikipedia infoboxes.

Other approaches (DeepDive [8], Nell2RDF [9], and PROSPERA [10]) crawl the web and use machine learning and natural language processing to build web-scale knowledge graphs.

Problem Description

Knowledge GraphChallenges we are solvingBecause current knowledge bases / ontology learning systems typically requires explicitly modeling nodes and edges into a graph ahead of time, this unfortunately presents several limitations to the use of such a knowledge graph:

Entities not modeled explicitly as nodes have no known relationships to any other entities.

Edges exist between nodes, but not between arbitrary combinations of nodes, and therefore such a graph is not ideal for representing nuanced meanings of an entity when appearing within different contexts, as is common within natural language.

Substantial meaning is encoded in the linguistic representation of the domain that is lost when the underlying textual representation is not preserved: phrases, interaction of concepts through actions (i.e. verbs), positional ordering of entities and the phrases containing those entities, variations in spelling and other representations of entities, the use of adjectives to modify entities to represent more complex concepts, and aggregate frequencies of occurrence for different representations of entities relative to other representations.

It can be an arduous process to create robust ontologies, map a domain into a graph representing those ontologies, and ensure the generated graph is compact, accurate, comprehensive, and kept up to date.

Knowledge GraphSemantic Data Encoded into Free Text Content

Model

id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, javaid: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemyid: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernatefieldtermpostings listdocposdesca142131, 5at1324company16doing2638engineer1233, 7great15hard27hospital25java36nurse23or34registered22software1132work21039job_titlejava developer31

fielddoctermdesc1aatcompanyengineergreatsoftware2aatdoinghardhospitalnurseregisteredwork3adoingengineerjavaorsoftwareworkjob_title1Software Engineer

Terms-Docs Inverted IndexDocs-Terms Uninverted IndexDocuments

Knowledge Graph

Knowledge GraphSet-theory ViewGraph ViewHow the Graph Traversal Worksskill: Javaskill: Scalaskill: Hibernateskill: Oncologyhas_related_skillhas_related_skillhas_related_skilldoc 1doc 2doc 3doc 4doc 5doc 6skill: Javaskill: Javaskill: Scalaskill: Hibernateskill: OncologyData Structure View

JavaScalaHibernatedocs1, 2, 6docs 3, 4

Oncologydoc 5

Knowledge GraphGraph Model

Structure:Single-level Traversal / Scoring:Multi-level Traversal / Scoring:

Knowledge GraphMulti-level TraversalData Structure ViewGraph Viewdoc 1doc 2doc 3doc 4doc 5doc 6skill: Javaskill: Javaskill: Scalaskill: Hibernateskill: Oncologydoc 1doc 2doc 3doc 4doc 5doc 6job_title: Software Engineerjob_title: Data Scientistjob_title: Java DeveloperInverted Index LookupDoc Values Index LookupDoc Values Index Lookup

Inverted Index Lookup

JavaJava DeveloperHibernateScalaSoftware EngineerData Scientisthas_related_skillhas_related_skillhas_related_skillhas_related_job_titlehas_related_job_titlehas_related_job_titlehas_related_job_titlehas_related_job_titlehas_related_job_title

Knowledge GraphScoring nodes in the GraphForeground vs. Background AnalysisEvery term scored against its context. The more commonly the term appears within its foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 }, { "value":spark", "relatedness": 0.9634, "popularity":15653 }, { "value":".net", "relatedness": 0.5417, "popularity":17683 }, { "value":"bogus_word", "relatedness": 0.0, "popularity":0 }, { "value":"teaching", "relatedness": -0.1510, "popularity":9923 }, { "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }+-Foreground Query: "Hadoop"

Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain. DSAA 2016.

Knowledge GraphMulti-level Graph Traversal with Scores

software engineer*(materialized node)JavaC#.NET.NET DeveloperJava DeveloperHibernateScalaVB.NETSoftware EngineerData ScientistSkillNodeshas_related_skillStartingNodeSkillNodeshas_related_skillJob TitleNodeshas_related_job_title0.900.880.930.930.340.740.910.890.740.890.780.720.480.930.760.830.800.640.610.780.55

Knowledge GraphMaterialization of new nodes through shared documents

Implementation

Open Sourced!

Knowledge GraphPopulating the Graph

Knowledge Graph

Knowledge Graph

Experiments

Knowledge GraphData Cleansing

{ "type":"keywords, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 }, { "value":spark", "relatedness": 0.9634, "popularity":15653 }, { "value":".net", "relatedness": 0.5417, "popularity":17683 }, { "value":"bogus_word", "relatedness": 0.0, "popularity":0 }, { "value":"teaching", "relatedness": -0.1510, "popularity":9923 }, { "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }Foreground Query: "Hadoop"Experiment: Data analyst manually annotated 500 pairs of terms found together in real query logs as relevant or not relevant

Results: SKG removed 78% of the terms while maintaining a 95% accuracy at removing the correct noisy pairs from the input data.

Knowledge GraphPredictive Analytics

Knowledge GraphSearch ExpansionExperiment: Take an initial query, and expand keyword phrases to include the most related entities to that query

Example:

The Semantic Search ProblemUsers Query:

machine learning research and development Portland, OR software engineer AND hadoop, java

Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java)

Semantic Query Parsing:"machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java

Semantically Expanded Query:("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

machine learningKeywords:Search Behavior,Application Behavior, etc.Job Title Classifier, Skills Extractor, Job Level Classifier, etc.Semantic Interpretationkeywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: ( job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }

Modified Query:

Related Occupationsmachine learning: {15-1031.00 .58Computer Software Engineers, Applications15-1011.00 .55Computer and Information Scientists, Research15-1032.00 .52 Computer Software Engineers, Systems Software }

machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }Common Job TitlesQuery Expansion

Related Phrasesmachine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 }

Known keyword phrasesjava developermachine learningregistered nurse

FST

Knowledge Graph in

+

Knowledge GraphDocument SummarizationExperiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.

Additionally, can traverse the graph to related entities/keyword phrases NOT found in the original document

Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring

Document Enrichment Find / Score Relationships

Document Summarization Rank / Clean Keywords

Knowledge GraphFuture WorkSemantic Search (more experiments)Search Engine Relevancy AlgorithmsTrending TopicsRecommendation SystemsRoot Cause AnalysisAbuse Detection

Knowledge GraphConclusionApplications:The Semantic Knowledge Graph has numerous applications, including automatically building ontologies, identification of trending topics over time, predictive analytics on timeseries data, root-cause analysis surfacing concepts related to failure scenarios from free text, data cleansing, document summarization, semantic search interpretation and expansion of queries, recommendation systems, and numerous other forms of anomaly detection.

Main contribution of this paper: The introduction (and open sourcing) of the the Semantic Knowledge Graph, a novel and compact new graph model that can dynamically materialize and score the relationships between any arbitrary combination of entities represented within a corpus of documents.

Knowledge GraphReferences

Contact InfoTrey Grainger trey.grainger@lucidworks.com @treygrainger

http://solrinaction.com

Other presentations: http://www.treygrainger.com

Recommended

View more >