Enriching knowledge graphs with text processing...

Enriching knowledge graphs with text processing techniques

J.-M. Le Goff, CERN

A. Rattinger CERN & Graz University of Technology

I-Know, Graz, Austria 12 October 2017

ERCIM News 111:https://ercim-news.ercim.eu/en111/r-i/collaboration-spotting-a-visual-analytics-platform-to-assist-knowledge-discovery

Agenda • Our approach:

• Building knowledge graph and graph data model • Identifying Concepts and relationships in full text • Enriching knowledge graph

• Text mining for graph extensions (preliminary) • Use case: patents and publications analytics

iKnow2017 2

Heterogeneous data sources Text Data model

Why building a knowledge graph out of datasets?

• Datasets contain valuable information for visual analytics • Businesses, applications, domains, etc.

• Datasets are difficult to use directly for visual analytics • They contain complex structures with various data types • They come with different data representations

• Structured, semi-structured and unstructured • Only subsets are of interest for a particular analysis • Need domain specific information to understand analytics output • Build data network with elements of interest in datasets

• Vertices: data instances with labelled data types • Relationships: interconnectivity

iKnow2017 4

Data network stored as a graph • Complexity • Interconnectivity • Scalability • Multi dimensionality

Graphs are natural representations of large

and interconnected networks

• Node and relationship labels • Compact graph structure • Graph query language • No need for schema evolution

Data model is embedded in the graph

itself

• Data model: labels and relationships • Labels Graph dimensions • Relationships Interconnectivity between

Labels

Graphs of connected elements constitute multi-dimensional

networks

5 Data Network = Knowledge graph

Knowledge graph Data model

Graph of data instances and relationships Graph data model: Schema

Data model embedded in knowledge graph

Data source

Data sources

Knowledge Graph

Visual analytics

Visual analytics is performed on the network using its schema

Processing Populating Organising Labelling

Labelling • Labelling Vertices

• Semi-structured data: • Metadata Structural information

• Tags labels

• Structured data: • Relational Databases tables, fields labels

• Text processing to create new labels, new vertices • Labelling Relationships

• Semi-structured data: • Relationships from nested tags (Has, isPartOf, etc.)

• Structured data: • Relational Databases No labels in E-R Models

• Vertex Labels + text information to label relationships • Ex: IsA, send, receive, live, own, etc.

8 12/10/2017 iKnow2017

Ex: Publications/Patents Metadata • Published Items

• Publications: (Scopus, WoK, etc.) • Organisation address (in data) • Keyword (in data) • Category (in data)

• Journal Category

• Patents: (PatStat, etc.) • Organisation Address (in data) • Category (in data)

• Patent class

9 12/10/2017 iKnow2017

Data Model from Metadata tags

Document metadata Data Model: Graph of labels

Kw: Keyword, PubItem: Published Item OrgAdd: Organisation Address 10

PubItem:Pub

PubItem:Pat

Org: Addre

KW Cat: Scat

Cat: PatClas

has has

Publications/Patents Metadata • Published Items

• Publications: • Organisation address Text processing

• Organisation (from other data sources: Company, Institute) • City • Country

• Keyword (in data) • Category (in data)

• Journal Category • Patents:

• Organisation Address Text processing • Organisation (from other data sources: Company, Institute) • City • Country

• Category (in data) • Patent class

11 12/10/2017 iKnow2017

Exploiting text information

Document metadata Data Model: Graph of labels

Kw: Keyword, Org: Organisation, Inst: Institute, Comp: Company, Cny: Country Cty: City, OrgAdd: Organisation Address 12

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

isLocated isLocated

Data sources: - patent & publications metadata - Patent full text (USPTO)

Preliminary work

Publications/Patents analytics

• Use case: Who are the key organisations active in a particular technology? • Motivations

• Technology monitoring, • How an emerging technology is evolving (Research Industry)

• foresight studies, • looking for partners, join collaborations

• Company, institution landscape

iKnow2017 14

Publications/Patents analytics (2) • Use case: What is the organisation

landscape of a technology? • Technique:

• Search for pub/pat matching “technology terms” • Titles and/or abstracts

• Issues • Quality of the “technology terms” to identify a technology • Search terms may not correspond to a single technology • Some pertinent publications and patents may not contain the

“technology terms”

• Use text processing to address these issues

iKnow2017 15

Add search output to knowledge graph

Search output: A subset of publications and patents matching “technology terms”

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

isLocated isLocated

Illustrating the approach Through Silicon Via

• Search terms on titles of pub/pat: • “Through Silicon Via”

• Exact matching High quality output • “TSV”

• More but with lower quality output

iKnow2017 17

Wikipedia

Technology: Through Silicon Via (TSV)

iKnow2017 18

Keywords: Through Silicon Via (title)

iKnow2017 19

Keywords: TSV (title)

iKnow2017 20 TSV also means: Taura Syndrome Virus

Methodology

Preliminary work

Approach • Index Patents • Patent specific preprocessing • Create document embedding / Feature

vectors (Doc2Vec) • Dimensionality reduction / Manifold learning

(t-SNE, LargeVis) • Visualization (Datashader – Large scale

visualization)

iKnow2017 22

Index Patents / Preprocessing • Current Dataset: USPTO Patents from 2006

- 2014 • Candidate Generation: Patents are indexed

for fuzzy search (lucene) • Preprocessing

• Clean HTML syntax • Remove references, stopwords

iKnow2017 23

Document Embedding • Numeric representation of text documents • Gensim Implementation - Based on

word2vec-CBOW

iKnow2017 24

Concept: Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents.

Dimensionality Reduction • Our dataset contains 2 million points with

300 Dimensions • Many costly techniques (MDS, t-SNE)

iKnow2017 25

Large Scale Visualization • Datashader

• Rasterization pipeline • Handles large amounts

of data

• Patent Class Labels • A: Human necessities • B: Performing operations;

transporting • C: Chemistry; Metallurgy • D: Textiles; Paper • E: Fixed Constructions • F: Mechanical engineering;

lighting; • G: Physics • H: Electricity

iKnow2017 26

Colours correspond to distinct labels (2,022,349 Patents)

Full text processing • Patent full text: Title + Abstract + Description

• USPTO full text patent 2006 – 2015 • Objective

• 1: Look for pertinent patents that do not contain ”Through Silicon Via”

• 2: Look for different meanings of “TSV”

iKnow2017 27

International Patent Classification (IPC)

• Sections from A (“Human Necessities”) to H (“Electricity”) • Classes (A01 "Agriculture; forestry; animal

husbandry; trapping; fishing") • Subclasses, Group Number, Subgroup

• Example: H01L 23/00 (Details of semiconductor or other solid state devices)

iKnow2017

Patent Class G (Physics) Overview

iKnow2017 29

378,692 Patents

Patent Search (TSV or Through Silicon Via)

iKnow2017 30

Most prominent patent classes (1696 results) Document Embedding

Comparison of Through Silicon Via vs TSV

iKnow2017 31

• Relevant data differs for relevant search results

“Through Silicon Via” (954 Results) “TSV” (982 Results)

Through Silicon Via vs TSV

• Relevant documents to the search for different terms

“Through Silicon Via” (954 Results) “TSV” (982 Results)

iKnow2017

Derive new relationships

iKnow2017 33

Search

PubItem:Pat

Cat: PatClas

PubItem:Pub: WoK

KW Cat: Scat

Cty Cny

Org: Comp

Org Address

Org: Inst

has has

isLocated isLocated

Search extended to publications

• Publications: Titles + Abstracts • Publications offer a different viewpoint • Publications are classified according to the

n-closest patent classes

iKnow2017

• G6F: Electric Digital Data Processing (Through Silicon Via (Antenna))

• A61K: Medical or Veterinary Science (Taura-Syndrom-Virus, tachycardia beat, Tellerspülvermögens)

Conclusion • Networks populated with metadata need to be

enriched to properly support Visual Analytics • Enrichment can come from additional data sources or

from text processing on the documents referenced in the metadata

• Preliminary text processing results on patents (title, abstract and description) indicate that it is possible to: • Enrich a patent set w.r.t. a search result • Regroup patents via patent categories showing different

meaning of search terms • Link some of the publications with nearby patents

35 iKnow2017

Thank you for your attention!

Enriching knowledge graphs with text processing...

Documents

Effective, efficient, enriching

ORE - A Tool for Repairing and Enriching Knowledge Basessvn.aksw.org/papers/2010/ORE/public.pdfwhether a statement follows from a knowledge base, whereas in inductive learning we invent

Enriching Lives

Enriching our Knowledge: State and Local Data to Inform ... · 25.09.2019 · State and Local Data to Inform Health Surveillance for People with ID D . 1 . Enriching our Knowledge:

Garciaolmosrubio Between Enriching

YADAVA COLLEGEenlightenment – dispelling ignorance and enriching knowledge. Cow( Kamadhenu) connotes endowment of wealth especially wealth of knowledge and wisdom. The Open Book

Enriching Me

Enriching scholarship140506

Enriching Software Process Support by Knowledge-based Techniques

Adolescent obesity: The emerging menace - Semantic Scholar€¦ · empowering providers, imparting community education, and enriching and reinforcing individual knowledge and skills

Enriching Knowledge in Business Process Modelling: A ...staff.sim.vuw.ac.nz/pedro-antunes/wp-content/uploads/km-book-15.pdf · Enriching Knowledge in Business Process Modelling: A

Life Enriching Actividades

Enriching Your World

Enriching learning

Enriching Email

Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

· Web viewbenefiting banks : enriching communitiesAppendix 1. benefiting banks : enriching communities-i-benefiting banks : enriching communitiesAppendix 2

SCHOLARSHIPS & FINANCIAL AID - Imperiaimperia.edu.my/iit/wp-content/uploads/2013/08/SCHORLARSHIP-FINANCIAL... · SCHOLARSHIPS & FINANCIAL AID ENRICHING KNOWLEDGE Suite 11.01, 11th

Learning from one another: Enriching interactive knowledge ...€¦ · Building blocks for interactive knowledge-sharing mechanisms 8 Five innovative examples that others could adopt

Enriching lives through recreation Enriching lives through recreation