Enriching knowledge graphs with text processing...

Preview:

Citation preview

Enriching knowledge graphs with text processing techniques

J.-M. Le Goff, CERN

A. Rattinger CERN & Graz University of Technology

I-Know, Graz, Austria 12 October 2017

1

ERCIM News 111:https://ercim-news.ercim.eu/en111/r-i/collaboration-spotting-a-visual-analytics-platform-to-assist-knowledge-discovery

Agenda • Our approach:

• Building knowledge graph and graph data model • Identifying Concepts and relationships in full text • Enriching knowledge graph

• Text mining for graph extensions (preliminary) • Use case: patents and publications analytics

iKnow2017 2

Heterogeneous data sources Text Data model

3

Why building a knowledge graph out of datasets?

• Datasets contain valuable information for visual analytics • Businesses, applications, domains, etc.

• Datasets are difficult to use directly for visual analytics • They contain complex structures with various data types • They come with different data representations

• Structured, semi-structured and unstructured • Only subsets are of interest for a particular analysis • Need domain specific information to understand analytics output • Build data network with elements of interest in datasets

• Vertices: data instances with labelled data types • Relationships: interconnectivity

iKnow2017 4

Data network stored as a graph • Complexity • Interconnectivity • Scalability • Multi dimensionality

Graphs are natural representations of large

and interconnected networks

• Node and relationship labels • Compact graph structure • Graph query language • No need for schema evolution

Data model is embedded in the graph

itself

• Data model: labels and relationships • Labels Graph dimensions • Relationships Interconnectivity between

Labels

Graphs of connected elements constitute multi-dimensional

networks

5 Data Network = Knowledge graph

Knowledge graph Data model

6

Graph of data instances and relationships Graph data model: Schema

Data model embedded in knowledge graph

7

Data source

1

Data source

n

Data sources

Knowledge Graph

Visual analytics

Visual analytics is performed on the network using its schema

Processing Populating Organising Labelling

Labelling • Labelling Vertices

• Semi-structured data: • Metadata Structural information

• Tags labels

• Structured data: • Relational Databases tables, fields labels

• Text processing to create new labels, new vertices • Labelling Relationships

• Semi-structured data: • Relationships from nested tags (Has, isPartOf, etc.)

• Structured data: • Relational Databases No labels in E-R Models

• Vertex Labels + text information to label relationships • Ex: IsA, send, receive, live, own, etc.

8 12/10/2017 iKnow2017

Ex: Publications/Patents Metadata • Published Items

• Publications: (Scopus, WoK, etc.) • Organisation address (in data) • Keyword (in data) • Category (in data)

• Journal Category

• Patents: (PatStat, etc.) • Organisation Address (in data) • Category (in data)

• Patent class

9 12/10/2017 iKnow2017

Data Model from Metadata tags

Document metadata Data Model: Graph of labels

Kw: Keyword, PubItem: Published Item OrgAdd: Organisation Address 10

PubItem:Pub

PubItem:Pat

Org: Addre

ss

KW Cat: Scat

Cat: PatClas

has has

has

has

has

Publications/Patents Metadata • Published Items

• Publications: • Organisation address Text processing

• Organisation (from other data sources: Company, Institute) • City • Country

• Keyword (in data) • Category (in data)

• Journal Category • Patents:

• Organisation Address Text processing • Organisation (from other data sources: Company, Institute) • City • Country

• Category (in data) • Patent class

11 12/10/2017 iKnow2017

Exploiting text information

Document metadata Data Model: Graph of labels

Kw: Keyword, Org: Organisation, Inst: Institute, Comp: Company, Cny: Country Cty: City, OrgAdd: Organisation Address 12

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Data sources: - patent & publications metadata - Patent full text (USPTO)

13

Preliminary work

Publications/Patents analytics

• Use case: Who are the key organisations active in a particular technology? • Motivations

• Technology monitoring, • How an emerging technology is evolving (Research Industry)

• foresight studies, • looking for partners, join collaborations

• Company, institution landscape

iKnow2017 14

Publications/Patents analytics (2) • Use case: What is the organisation

landscape of a technology? • Technique:

• Search for pub/pat matching “technology terms” • Titles and/or abstracts

• Issues • Quality of the “technology terms” to identify a technology • Search terms may not correspond to a single technology • Some pertinent publications and patents may not contain the

“technology terms”

• Use text processing to address these issues

iKnow2017 15

16

Add search output to knowledge graph

Search output: A subset of publications and patents matching “technology terms”

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Illustrating the approach Through Silicon Via

• Search terms on titles of pub/pat: • “Through Silicon Via”

• Exact matching High quality output • “TSV”

• More but with lower quality output

iKnow2017 17

Wikipedia

Technology: Through Silicon Via (TSV)

iKnow2017 18

Keywords: Through Silicon Via (title)

iKnow2017 19

Keywords: TSV (title)

iKnow2017 20 TSV also means: Taura Syndrome Virus

Methodology

21

Preliminary work

Approach • Index Patents • Patent specific preprocessing • Create document embedding / Feature

vectors (Doc2Vec) • Dimensionality reduction / Manifold learning

(t-SNE, LargeVis) • Visualization (Datashader – Large scale

visualization)

iKnow2017 22

Index Patents / Preprocessing • Current Dataset: USPTO Patents from 2006

- 2014 • Candidate Generation: Patents are indexed

for fuzzy search (lucene) • Preprocessing

• Clean HTML syntax • Remove references, stopwords

iKnow2017 23

Document Embedding • Numeric representation of text documents • Gensim Implementation - Based on

word2vec-CBOW

iKnow2017 24

Concept: Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.

Dimensionality Reduction • Our dataset contains 2 million points with

300 Dimensions • Many costly techniques (MDS, t-SNE)

iKnow2017 25

Large Scale Visualization • Datashader

• Rasterization pipeline • Handles large amounts

of data

• Patent Class Labels • A: Human necessities • B: Performing operations;

transporting • C: Chemistry; Metallurgy • D: Textiles; Paper • E: Fixed Constructions • F: Mechanical engineering;

lighting; • G: Physics • H: Electricity

iKnow2017 26

Colours correspond to distinct labels (2,022,349 Patents)

Full text processing • Patent full text: Title + Abstract + Description

• USPTO full text patent 2006 – 2015 • Objective

• 1: Look for pertinent patents that do not contain ”Through Silicon Via”

• 2: Look for different meanings of “TSV”

iKnow2017 27

International Patent Classification (IPC)

28

• Sections from A (“Human Necessities”) to H (“Electricity”) • Classes (A01 "Agriculture; forestry; animal

husbandry; trapping; fishing") • Subclasses, Group Number, Subgroup

• Example: H01L 23/00 (Details of semiconductor or other solid state devices)

iKnow2017

Patent Class G (Physics) Overview

iKnow2017 29

378,692 Patents

Patent Search (TSV or Through Silicon Via)

iKnow2017 30

Most prominent patent classes (1696 results) Document Embedding

Comparison of Through Silicon Via vs TSV

iKnow2017 31

• Relevant data differs for relevant search results

“Through Silicon Via” (954 Results) “TSV” (982 Results)

Through Silicon Via vs TSV

32

• Relevant documents to the search for different terms

“Through Silicon Via” (954 Results) “TSV” (982 Results)

iKnow2017

Derive new relationships

iKnow2017 33

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Search extended to publications

34

• Publications: Titles + Abstracts • Publications offer a different viewpoint • Publications are classified according to the

n-closest patent classes

iKnow2017

• G6F: Electric Digital Data Processing (Through Silicon Via (Antenna))

• A61K: Medical or Veterinary Science (Taura-Syndrom-Virus, tachycardia beat, Tellerspülvermögens)

Conclusion • Networks populated with metadata need to be

enriched to properly support Visual Analytics • Enrichment can come from additional data sources or

from text processing on the documents referenced in the metadata

• Preliminary text processing results on patents (title, abstract and description) indicate that it is possible to: • Enrich a patent set w.r.t. a search result • Regroup patents via patent categories showing different

meaning of search terms • Link some of the publications with nearby patents

35 iKnow2017

Thank you for your attention!

Recommended