36
Enriching knowledge graphs with text processing techniques J.-M. Le Goff, CERN A. Rattinger CERN & Graz University of Technology I-Know, Graz, Austria 12 October 2017 1 ERCIM News 111:https://ercim-news.ercim.eu/en111/r-i/collaboration-spotting-a-visual-analytics-platform-to-assist-knowledge-discovery

Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Enriching knowledge graphs with text processing techniques

J.-M. Le Goff, CERN

A. Rattinger CERN & Graz University of Technology

I-Know, Graz, Austria 12 October 2017

1

ERCIM News 111:https://ercim-news.ercim.eu/en111/r-i/collaboration-spotting-a-visual-analytics-platform-to-assist-knowledge-discovery

Page 2: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Agenda • Our approach:

• Building knowledge graph and graph data model • Identifying Concepts and relationships in full text • Enriching knowledge graph

• Text mining for graph extensions (preliminary) • Use case: patents and publications analytics

iKnow2017 2

Page 3: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Heterogeneous data sources Text Data model

3

Page 4: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Why building a knowledge graph out of datasets?

• Datasets contain valuable information for visual analytics • Businesses, applications, domains, etc.

• Datasets are difficult to use directly for visual analytics • They contain complex structures with various data types • They come with different data representations

• Structured, semi-structured and unstructured • Only subsets are of interest for a particular analysis • Need domain specific information to understand analytics output • Build data network with elements of interest in datasets

• Vertices: data instances with labelled data types • Relationships: interconnectivity

iKnow2017 4

Page 5: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Data network stored as a graph • Complexity • Interconnectivity • Scalability • Multi dimensionality

Graphs are natural representations of large

and interconnected networks

• Node and relationship labels • Compact graph structure • Graph query language • No need for schema evolution

Data model is embedded in the graph

itself

• Data model: labels and relationships • Labels Graph dimensions • Relationships Interconnectivity between

Labels

Graphs of connected elements constitute multi-dimensional

networks

5 Data Network = Knowledge graph

Page 6: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Knowledge graph Data model

6

Graph of data instances and relationships Graph data model: Schema

Data model embedded in knowledge graph

Page 7: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

7

Data source

1

Data source

n

Data sources

Knowledge Graph

Visual analytics

Visual analytics is performed on the network using its schema

Processing Populating Organising Labelling

Page 8: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Labelling • Labelling Vertices

• Semi-structured data: • Metadata Structural information

• Tags labels

• Structured data: • Relational Databases tables, fields labels

• Text processing to create new labels, new vertices • Labelling Relationships

• Semi-structured data: • Relationships from nested tags (Has, isPartOf, etc.)

• Structured data: • Relational Databases No labels in E-R Models

• Vertex Labels + text information to label relationships • Ex: IsA, send, receive, live, own, etc.

8 12/10/2017 iKnow2017

Page 9: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Ex: Publications/Patents Metadata • Published Items

• Publications: (Scopus, WoK, etc.) • Organisation address (in data) • Keyword (in data) • Category (in data)

• Journal Category

• Patents: (PatStat, etc.) • Organisation Address (in data) • Category (in data)

• Patent class

9 12/10/2017 iKnow2017

Page 10: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Data Model from Metadata tags

Document metadata Data Model: Graph of labels

Kw: Keyword, PubItem: Published Item OrgAdd: Organisation Address 10

PubItem:Pub

PubItem:Pat

Org: Addre

ss

KW Cat: Scat

Cat: PatClas

has has

has

has

has

Page 11: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Publications/Patents Metadata • Published Items

• Publications: • Organisation address Text processing

• Organisation (from other data sources: Company, Institute) • City • Country

• Keyword (in data) • Category (in data)

• Journal Category • Patents:

• Organisation Address Text processing • Organisation (from other data sources: Company, Institute) • City • Country

• Category (in data) • Patent class

11 12/10/2017 iKnow2017

Page 12: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Exploiting text information

Document metadata Data Model: Graph of labels

Kw: Keyword, Org: Organisation, Inst: Institute, Comp: Company, Cny: Country Cty: City, OrgAdd: Organisation Address 12

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Page 13: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Data sources: - patent & publications metadata - Patent full text (USPTO)

13

Preliminary work

Page 14: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Publications/Patents analytics

• Use case: Who are the key organisations active in a particular technology? • Motivations

• Technology monitoring, • How an emerging technology is evolving (Research Industry)

• foresight studies, • looking for partners, join collaborations

• Company, institution landscape

iKnow2017 14

Page 15: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Publications/Patents analytics (2) • Use case: What is the organisation

landscape of a technology? • Technique:

• Search for pub/pat matching “technology terms” • Titles and/or abstracts

• Issues • Quality of the “technology terms” to identify a technology • Search terms may not correspond to a single technology • Some pertinent publications and patents may not contain the

“technology terms”

• Use text processing to address these issues

iKnow2017 15

Page 16: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

16

Add search output to knowledge graph

Search output: A subset of publications and patents matching “technology terms”

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Page 17: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Illustrating the approach Through Silicon Via

• Search terms on titles of pub/pat: • “Through Silicon Via”

• Exact matching High quality output • “TSV”

• More but with lower quality output

iKnow2017 17

Wikipedia

Page 18: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Technology: Through Silicon Via (TSV)

iKnow2017 18

Page 19: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Keywords: Through Silicon Via (title)

iKnow2017 19

Page 20: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Keywords: TSV (title)

iKnow2017 20 TSV also means: Taura Syndrome Virus

Page 21: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Methodology

21

Preliminary work

Page 22: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Approach • Index Patents • Patent specific preprocessing • Create document embedding / Feature

vectors (Doc2Vec) • Dimensionality reduction / Manifold learning

(t-SNE, LargeVis) • Visualization (Datashader – Large scale

visualization)

iKnow2017 22

Page 23: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Index Patents / Preprocessing • Current Dataset: USPTO Patents from 2006

- 2014 • Candidate Generation: Patents are indexed

for fuzzy search (lucene) • Preprocessing

• Clean HTML syntax • Remove references, stopwords

iKnow2017 23

Page 24: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Document Embedding • Numeric representation of text documents • Gensim Implementation - Based on

word2vec-CBOW

iKnow2017 24

Concept: Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.

Page 25: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Dimensionality Reduction • Our dataset contains 2 million points with

300 Dimensions • Many costly techniques (MDS, t-SNE)

iKnow2017 25

Page 26: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Large Scale Visualization • Datashader

• Rasterization pipeline • Handles large amounts

of data

• Patent Class Labels • A: Human necessities • B: Performing operations;

transporting • C: Chemistry; Metallurgy • D: Textiles; Paper • E: Fixed Constructions • F: Mechanical engineering;

lighting; • G: Physics • H: Electricity

iKnow2017 26

Colours correspond to distinct labels (2,022,349 Patents)

Page 27: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Full text processing • Patent full text: Title + Abstract + Description

• USPTO full text patent 2006 – 2015 • Objective

• 1: Look for pertinent patents that do not contain ”Through Silicon Via”

• 2: Look for different meanings of “TSV”

iKnow2017 27

Page 28: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

International Patent Classification (IPC)

28

• Sections from A (“Human Necessities”) to H (“Electricity”) • Classes (A01 "Agriculture; forestry; animal

husbandry; trapping; fishing") • Subclasses, Group Number, Subgroup

• Example: H01L 23/00 (Details of semiconductor or other solid state devices)

iKnow2017

Page 29: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Patent Class G (Physics) Overview

iKnow2017 29

378,692 Patents

Page 30: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Patent Search (TSV or Through Silicon Via)

iKnow2017 30

Most prominent patent classes (1696 results) Document Embedding

Page 31: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Comparison of Through Silicon Via vs TSV

iKnow2017 31

• Relevant data differs for relevant search results

“Through Silicon Via” (954 Results) “TSV” (982 Results)

Page 32: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Through Silicon Via vs TSV

32

• Relevant documents to the search for different terms

“Through Silicon Via” (954 Results) “TSV” (982 Results)

iKnow2017

Page 33: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Derive new relationships

iKnow2017 33

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

has

has

has

isa

isa

isLocated isLocated

Page 34: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Search extended to publications

34

• Publications: Titles + Abstracts • Publications offer a different viewpoint • Publications are classified according to the

n-closest patent classes

iKnow2017

• G6F: Electric Digital Data Processing (Through Silicon Via (Antenna))

• A61K: Medical or Veterinary Science (Taura-Syndrom-Virus, tachycardia beat, Tellerspülvermögens)

Page 35: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Conclusion • Networks populated with metadata need to be

enriched to properly support Visual Analytics • Enrichment can come from additional data sources or

from text processing on the documents referenced in the metadata

• Preliminary text processing results on patents (title, abstract and description) indicate that it is possible to: • Enrich a patent set w.r.t. a search result • Regroup patents via patent categories showing different

meaning of search terms • Link some of the publications with nearby patents

35 iKnow2017

Page 36: Enriching knowledge graphs with text processing techniquesmagazin.know-center.tugraz.at/downloads/2017/RS-SNA... · 2017. 11. 29. · Agenda • Our approach: • Building knowledge

Thank you for your attention!