31
WWW.LEDS-PROJEKT.DE ECCENCA CORPORATE MEMORY SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES ROBERT ISELE September 26, 2016 1

eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

Embed Size (px)

Citation preview

Page 1: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

WWW.LEDS-PROJEKT.DE

ECCENCA CORPORATE MEMORY

SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES

ROBERT ISELE

September 26, 2016

1

Page 2: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

MOTIVATION

Enterprise Data Management Objective:

“Ensure all data is aligned to a common meaning in order to achieve automation in performing complex analytics and generating trusted reports.”

Source:

2015 Data Management Industry Benchmark -EDM Council

September 26, 2016

2

In 2015 only 7% of respondents claim to already be using shared and unambiguous definitions of data across the firm and have it accessible as operational metadata.

7%

Page 3: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

September 26, 2016

3

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Corporate Memory

Inbound

Data Sources

Outbound and Consumption

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems

Big Data DWH-Infrastructure

Page 4: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big DataDWH-Infrastructure

Data Ingestion• Files in the data lake (CSV, XML, Excel)• (relational) Databases

Page 5: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big Data

DWH-Infrastructure

Data Lake• Emerging approach to handle large amounts

of data• Cost-effective storage• Data is held in their native formats GoodDoes not force an up-front integration of the ingested data sets BadRetaining an overview of disparate data silos in the lake without having a coherent shared view is a challenging issue

Page 6: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big DataDWH-Infrastructure

Data Warehouses• Existing infrastucture• Typically relational databases

Page 7: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big DataDWH-Infrastructure

Metadata Layer• Dataset Metadata• Ontologies• Integration Rules

Page 8: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

ARCHITECTURE

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big DataDWH-Infrastructure

Graphical User Interface

Customer Applications

Page 9: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

September 26, 2016

9

Page 10: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET MANAGEMENT

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

September 26, 2016

10

Page 11: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET CATALOG

• Enables the user to explore and manage datasets in the data lake• Files in the data lake (CSV, XML, Excel)

• Databases (Apache Hive or external databases)

September 26, 2016

11

Page 12: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

MANAGING METADATA

• Exploring and editing dataset metadata • Semantic content information, like textual

descriptions, tags and related Persons

• Technical information and parameters, like formats, data model and encoding

• Access information, like access path or URL, source system or API call

• Organizational provenance, like organizational units owning or maintaining the dataset

September 26, 2016

12

Page 13: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET DISCOVERY

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

September 26, 2016

13

Page 14: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET DISCOVERY

• Goal: Augment a dataset with data from related datasets

• Automatic discovery of dataset with overlapping information

• Explorative interface

• Discovery is based on two data parts• Business meta data

• Profiling summary

September 26, 2016

14

Page 15: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

September 26, 2016

15

Page 16: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET PROFILING

• Datasets often contain implicit and explicit schema information• Column names, data formats, enumerated values etc.

• Example: column contains formatted dates

• Idea: Extract a dataset summary

• For each column / property the summary contains:1. Data type (e.g., number, date, industry classification)

2. Data format (e.g., date format)

3. Data statistics (e.g., range, distribution, most frequent values)

• Materialized as RDF with UI view

September 26, 2016

16

Page 17: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DETECTING DATA TYPES

• Detecting common datatypes as well as user-defined types

• Common datatypes• Numbers

• Dates / Times

• Geographic locations (geo-coordinates, states, countries)

• User-defined data types can be integrated by adding an ontology / taxonomy• Usually a SKOS taxonomy

• Managed as another dataset in the dataset management

• Example: Industry taxonomy• Standard taxonomy (NACE, SIC, NAICS) or company specific

September 26, 2016

17

Page 18: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

FORMATS AND STATISTICS

• For some types, the data format is detected• Example: Dates are formatted in DD-MM-YYYY

• Two functions are generated:1. Parser that is able to read the detected representation

2. Normalizer that converts the parsed values into a configurable, organization-wide target representation

• Statistics summarize the values:• Value range and distribution

• Most frequent values

• Data selectivity

September 26, 2016

18

Page 19: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

September 26, 2016

19

Page 20: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

September 26, 2016

20

Page 21: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATA INTEGRATION

• The integration process is driven by a set of rules• Lifting Rules map the source datasets to a ontology

• Linking Rules connect different datasets to a knowledge graph

• Rules are operator trees, consisting of four types of operators• Data Access Operators

• Transformation Operators

• Similarity Operators

• Aggregation Operators

• Rules can be learned using genetic programming algorithms

• Rules are human understandable and can be edited

September 26, 2016

21

Page 22: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DATASET LIFTING

• Objective: Map the datasets in the data lake to a consistent vocabulary.

• A lifting rule consists of a number of mappings• Each mapping assigns a term in the original data set (such as a column for tabular data)

to a term in the target ontology (such as a property provided by an ontology).

• Multiple mappings for each dataset can be managed to allow different views on the same data.

• Initial mappings are generated automatically based on the profiling results from where the user can continue to build on.

September 26, 2016

22

Page 23: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

LIFTING EXAMPLE

September 26, 2016

23

Bond ISIN Country Industry

NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking

SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical Equipment

Electricite de France (EDF), 6,5% 26jan2019

USF2893TAB29 France Utilities

NEDWBK CAD 5,2%25

fibo:hasSecurityIdentifier

Utilities

Industry Ontology

Banking

France

Country Ontology

Germany

EMEA

“CA639832AA25”

fibo:legallyRecordedIn

fibo:industrySector

Page 24: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

LINKING

• Goal: Connect individual datasets to a knowledge graph

• Identify related entities in different datasets and link them• Either entities describing the same real world object or another relation

September 26, 2016

24

NEDWBK CAD 5,2%25

ratingScore

Industry OntologyCountry Ontology

EMEA“AAA”

fibo:legallyRecordedIn

fibo:industrySector

Rating CAD 5,2%25

hasRating

fibo:industrySector

fibo:legallyRecordedIn

Page 25: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

LINKAGE RULES

• Linking is based on domain-specific rules

• Specify the conditions that must hold true for two entities to be linked

September 26, 2016

25

Page 26: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

LEARNING LINKAGE RULES

Problem: Manually writing rules is time-consuming and requires expertise

Approach: Interactive machine learning algorithm for generating rules

• Generates a rule based on a number of user-confirmed link candidates.

• Link candidates are actively selected by the learning algorithm to include link candidates that yield a high information gain.

• The user does not need any knowledge of the characteristics

of the dataset or any particular similarity computation techniques.

September 26, 2016

26

Page 27: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

Page 28: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

VIEW GENERATION

• The user selects a set of lifted and linked datasets

September 26, 2016

28

Page 29: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

Hadoop Data Lake

DATA ACCESS

• Generate data flows based on Apache Spark

• The data flows utilize Resilient Distributed Datasets (RDDs)

• RDDs derive new data sets from existing data sets by applying a chain of transformations

• A derived data set can either• be recomputed on-the-fly • persisted on stable storage

• Data flows can be executed efficiently on Hadoop clusters.

September 26, 2016

29

CorporateBonds

Data Lifting 1(Apache Spark

RDD)

Data Linking(Apache Spark RDD)

Internal Ratings

Data Lifting 2(Apache Spark

RDD)

External Ratings

Data Lifting 3(Apache Spark

RDD)

eccenca Corporate

Memory

Data Consumer

SQL CSVExcel

SparkAPI

Page 30: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

DEMO

Page 31: eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

Contact

Dr. Robert Isele

Tel: +49 151 17238616

email: [email protected]

eccencaCommand your Data!