View
1
Download
0
Category
Preview:
Citation preview
Enriching knowledge graphs with text processing techniques
J.-M. Le Goff, CERN
A. Rattinger CERN & Graz University of Technology
I-Know, Graz, Austria 12 October 2017
1
ERCIM News 111:https://ercim-news.ercim.eu/en111/r-i/collaboration-spotting-a-visual-analytics-platform-to-assist-knowledge-discovery
Agenda • Our approach:
• Building knowledge graph and graph data model • Identifying Concepts and relationships in full text • Enriching knowledge graph
• Text mining for graph extensions (preliminary) • Use case: patents and publications analytics
iKnow2017 2
Heterogeneous data sources Text Data model
3
Why building a knowledge graph out of datasets?
• Datasets contain valuable information for visual analytics • Businesses, applications, domains, etc.
• Datasets are difficult to use directly for visual analytics • They contain complex structures with various data types • They come with different data representations
• Structured, semi-structured and unstructured • Only subsets are of interest for a particular analysis • Need domain specific information to understand analytics output • Build data network with elements of interest in datasets
• Vertices: data instances with labelled data types • Relationships: interconnectivity
iKnow2017 4
Data network stored as a graph • Complexity • Interconnectivity • Scalability • Multi dimensionality
Graphs are natural representations of large
and interconnected networks
• Node and relationship labels • Compact graph structure • Graph query language • No need for schema evolution
Data model is embedded in the graph
itself
• Data model: labels and relationships • Labels Graph dimensions • Relationships Interconnectivity between
Labels
Graphs of connected elements constitute multi-dimensional
networks
5 Data Network = Knowledge graph
Knowledge graph Data model
6
Graph of data instances and relationships Graph data model: Schema
Data model embedded in knowledge graph
7
Data source
1
Data source
n
Data sources
Knowledge Graph
…
Visual analytics
Visual analytics is performed on the network using its schema
Processing Populating Organising Labelling
Labelling • Labelling Vertices
• Semi-structured data: • Metadata Structural information
• Tags labels
• Structured data: • Relational Databases tables, fields labels
• Text processing to create new labels, new vertices • Labelling Relationships
• Semi-structured data: • Relationships from nested tags (Has, isPartOf, etc.)
• Structured data: • Relational Databases No labels in E-R Models
• Vertex Labels + text information to label relationships • Ex: IsA, send, receive, live, own, etc.
8 12/10/2017 iKnow2017
Ex: Publications/Patents Metadata • Published Items
• Publications: (Scopus, WoK, etc.) • Organisation address (in data) • Keyword (in data) • Category (in data)
• Journal Category
• Patents: (PatStat, etc.) • Organisation Address (in data) • Category (in data)
• Patent class
9 12/10/2017 iKnow2017
Data Model from Metadata tags
Document metadata Data Model: Graph of labels
Kw: Keyword, PubItem: Published Item OrgAdd: Organisation Address 10
PubItem:Pub
PubItem:Pat
Org: Addre
ss
KW Cat: Scat
Cat: PatClas
has has
has
has
has
Publications/Patents Metadata • Published Items
• Publications: • Organisation address Text processing
• Organisation (from other data sources: Company, Institute) • City • Country
• Keyword (in data) • Category (in data)
• Journal Category • Patents:
• Organisation Address Text processing • Organisation (from other data sources: Company, Institute) • City • Country
• Category (in data) • Patent class
11 12/10/2017 iKnow2017
Exploiting text information
Document metadata Data Model: Graph of labels
Kw: Keyword, Org: Organisation, Inst: Institute, Comp: Company, Cny: Country Cty: City, OrgAdd: Organisation Address 12
PubItem:Pat
Cat: PatClas
PubItem:Pub: WoK
KW Cat: Scat
Cty Cny
Org: Comp
Org Address
Org: Inst
has has
has
has
has
isa
isa
isLocated isLocated
Data sources: - patent & publications metadata - Patent full text (USPTO)
13
Preliminary work
Publications/Patents analytics
• Use case: Who are the key organisations active in a particular technology? • Motivations
• Technology monitoring, • How an emerging technology is evolving (Research Industry)
• foresight studies, • looking for partners, join collaborations
• Company, institution landscape
iKnow2017 14
Publications/Patents analytics (2) • Use case: What is the organisation
landscape of a technology? • Technique:
• Search for pub/pat matching “technology terms” • Titles and/or abstracts
• Issues • Quality of the “technology terms” to identify a technology • Search terms may not correspond to a single technology • Some pertinent publications and patents may not contain the
“technology terms”
• Use text processing to address these issues
iKnow2017 15
16
Add search output to knowledge graph
Search output: A subset of publications and patents matching “technology terms”
Search
PubItem:Pat
Cat: PatClas
PubItem:Pub: WoK
KW Cat: Scat
Cty Cny
Org: Comp
Org Address
Org: Inst
has has
has
has
has
isa
isa
isLocated isLocated
Illustrating the approach Through Silicon Via
• Search terms on titles of pub/pat: • “Through Silicon Via”
• Exact matching High quality output • “TSV”
• More but with lower quality output
iKnow2017 17
Wikipedia
Technology: Through Silicon Via (TSV)
iKnow2017 18
Keywords: Through Silicon Via (title)
iKnow2017 19
Keywords: TSV (title)
iKnow2017 20 TSV also means: Taura Syndrome Virus
Methodology
21
Preliminary work
Approach • Index Patents • Patent specific preprocessing • Create document embedding / Feature
vectors (Doc2Vec) • Dimensionality reduction / Manifold learning
(t-SNE, LargeVis) • Visualization (Datashader – Large scale
visualization)
iKnow2017 22
Index Patents / Preprocessing • Current Dataset: USPTO Patents from 2006
- 2014 • Candidate Generation: Patents are indexed
for fuzzy search (lucene) • Preprocessing
• Clean HTML syntax • Remove references, stopwords
iKnow2017 23
Document Embedding • Numeric representation of text documents • Gensim Implementation - Based on
word2vec-CBOW
iKnow2017 24
Concept: Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.
Dimensionality Reduction • Our dataset contains 2 million points with
300 Dimensions • Many costly techniques (MDS, t-SNE)
iKnow2017 25
Large Scale Visualization • Datashader
• Rasterization pipeline • Handles large amounts
of data
• Patent Class Labels • A: Human necessities • B: Performing operations;
transporting • C: Chemistry; Metallurgy • D: Textiles; Paper • E: Fixed Constructions • F: Mechanical engineering;
lighting; • G: Physics • H: Electricity
iKnow2017 26
Colours correspond to distinct labels (2,022,349 Patents)
Full text processing • Patent full text: Title + Abstract + Description
• USPTO full text patent 2006 – 2015 • Objective
• 1: Look for pertinent patents that do not contain ”Through Silicon Via”
• 2: Look for different meanings of “TSV”
iKnow2017 27
International Patent Classification (IPC)
28
• Sections from A (“Human Necessities”) to H (“Electricity”) • Classes (A01 "Agriculture; forestry; animal
husbandry; trapping; fishing") • Subclasses, Group Number, Subgroup
• Example: H01L 23/00 (Details of semiconductor or other solid state devices)
iKnow2017
Patent Class G (Physics) Overview
iKnow2017 29
378,692 Patents
Patent Search (TSV or Through Silicon Via)
iKnow2017 30
Most prominent patent classes (1696 results) Document Embedding
Comparison of Through Silicon Via vs TSV
iKnow2017 31
• Relevant data differs for relevant search results
“Through Silicon Via” (954 Results) “TSV” (982 Results)
Through Silicon Via vs TSV
32
• Relevant documents to the search for different terms
“Through Silicon Via” (954 Results) “TSV” (982 Results)
iKnow2017
Derive new relationships
iKnow2017 33
Search
PubItem:Pat
Cat: PatClas
PubItem:Pub: WoK
KW Cat: Scat
Cty Cny
Org: Comp
Org Address
Org: Inst
has has
has
has
has
isa
isa
isLocated isLocated
Search extended to publications
34
• Publications: Titles + Abstracts • Publications offer a different viewpoint • Publications are classified according to the
n-closest patent classes
iKnow2017
• G6F: Electric Digital Data Processing (Through Silicon Via (Antenna))
• A61K: Medical or Veterinary Science (Taura-Syndrom-Virus, tachycardia beat, Tellerspülvermögens)
Conclusion • Networks populated with metadata need to be
enriched to properly support Visual Analytics • Enrichment can come from additional data sources or
from text processing on the documents referenced in the metadata
• Preliminary text processing results on patents (title, abstract and description) indicate that it is possible to: • Enrich a patent set w.r.t. a search result • Regroup patents via patent categories showing different
meaning of search terms • Link some of the publications with nearby patents
35 iKnow2017
Thank you for your attention!
Recommended