eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes

WWW.LEDS-PROJEKT.DE

ECCENCA CORPORATE MEMORY

SEMANTICALLY INTEGRATED ENTERPRISE DATA LAKES

ROBERT ISELE

September 26, 2016

1

http://2016.semantics.cc/robert-isele

MOTIVATION

Enterprise Data Management Objective:

“Ensure all data is aligned to a common meaning in order to achieve automation in performing complex analytics and generating trusted reports.”

Source:

2015 Data Management Industry Benchmark -EDM Council

September 26, 2016

2

In 2015 only 7% of respondents claim to already be using shared and unambiguous definitions of data across the firm and have it accessible as operational metadata.

7%

ARCHITECTURE

September 26, 2016

3

ManagementAccounting

Risk ManagementRegulatory Reporting

Treasury MarketingAccounting

Corporate Memory

Inbound

Data Sources

Outbound and Consumption

Inbound Raw Data Store

Knowledge Graph for Meta Data, KPI Definition and Data Models

Frontend to Access Relationship and KPI Definition / Documentation

Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems

Big Data DWH-Infrastructure

ARCHITECTURE







Frontend to Access (ad hoc) ReportsOutbound Data Delivery to

Target Systems

Big DataDWH-Infrastructure

Data Ingestion• Files in the data lake (CSV, XML, Excel)• (relational) Databases

ARCHITECTURE








Target Systems

Big Data

DWH-Infrastructure

Data Lake• Emerging approach to handle large amounts

of data• Cost-effective storage• Data is held in their native formats GoodDoes not force an up-front integration of the ingested data sets BadRetaining an overview of disparate data silos in the lake without having a coherent shared view is a challenging issue

ARCHITECTURE








Target Systems


Data Warehouses• Existing infrastucture• Typically relational databases

ARCHITECTURE








Target Systems


Metadata Layer• Dataset Metadata• Ontologies• Integration Rules

ARCHITECTURE








Target Systems


Graphical User Interface

Customer Applications

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets

•Catalog Ontologies

•Manage Metadata

Dataset Discovery

•Data Profiling

•Dataset Exploration

Dataset Integration

•Dataset Lifting

•Dataset Linking

•Data Quality Validation

Data Access

•Domain Specific Consolidated Views

•Execution on Hadoop

September 26, 2016

9

DATASET MANAGEMENT

Dataset Management

•Catalog Datasets


•Manage Metadata

Dataset Discovery

•Data Profiling


Dataset Integration

•Dataset Lifting

•Dataset Linking


Data Access



September 26, 2016

10

DATASET CATALOG

• Enables the user to explore and manage datasets in the data lake• Files in the data lake (CSV, XML, Excel)

• Databases (Apache Hive or external databases)

September 26, 2016

11

MANAGING METADATA

• Exploring and editing dataset metadata • Semantic content information, like textual

descriptions, tags and related Persons

• Technical information and parameters, like formats, data model and encoding

• Access information, like access path or URL, source system or API call

• Organizational provenance, like organizational units owning or maintaining the dataset

September 26, 2016

12

DATASET DISCOVERY

Dataset Management

•Catalog Datasets


•Manage Metadata

Dataset Discovery

•Data Profiling


Dataset Integration

•Dataset Lifting

•Dataset Linking


Data Access



September 26, 2016

13

DATASET DISCOVERY

• Goal: Augment a dataset with data from related datasets

• Automatic discovery of dataset with overlapping information

• Explorative interface

• Discovery is based on two data parts• Business meta data

• Profiling summary

September 26, 2016

14

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

September 26, 2016

15

DATASET PROFILING

• Datasets often contain implicit and explicit schema information• Column names, data formats, enumerated values etc.

• Example: column contains formatted dates

• Idea: Extract a dataset summary

• For each column / property the summary contains:1. Data type (e.g., number, date, industry classification)

2. Data format (e.g., date format)

3. Data statistics (e.g., range, distribution, most frequent values)

• Materialized as RDF with UI view

September 26, 2016

16

DETECTING DATA TYPES

• Detecting common datatypes as well as user-defined types

• Common datatypes• Numbers

• Dates / Times

• Geographic locations (geo-coordinates, states, countries)

• User-defined data types can be integrated by adding an ontology / taxonomy• Usually a SKOS taxonomy

• Managed as another dataset in the dataset management

• Example: Industry taxonomy• Standard taxonomy (NACE, SIC, NAICS) or company specific

September 26, 2016

17

FORMATS AND STATISTICS

• For some types, the data format is detected• Example: Dates are formatted in DD-MM-YYYY

• Two functions are generated:1. Parser that is able to read the detected representation

2. Normalizer that converts the parsed values into a configurable, organization-wide target representation

• Statistics summarize the values:• Value range and distribution

• Most frequent values

• Data selectivity

September 26, 2016

18

DISCOVERY VIEW

• Datasets are matched based on their metadata (profiling + business data)

September 26, 2016

19

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets


•Manage Metadata

Dataset Discovery

•Data Profiling


Dataset Integration

•Dataset Lifting

•Dataset Linking


Data Access



September 26, 2016

20

DATA INTEGRATION

• The integration process is driven by a set of rules• Lifting Rules map the source datasets to a ontology

• Linking Rules connect different datasets to a knowledge graph

• Rules are operator trees, consisting of four types of operators• Data Access Operators

• Transformation Operators

• Similarity Operators

• Aggregation Operators

• Rules can be learned using genetic programming algorithms

• Rules are human understandable and can be edited

September 26, 2016

21

DATASET LIFTING

• Objective: Map the datasets in the data lake to a consistent vocabulary.

• A lifting rule consists of a number of mappings• Each mapping assigns a term in the original data set (such as a column for tabular data)

to a term in the target ontology (such as a property provided by an ontology).

• Multiple mappings for each dataset can be managed to allow different views on the same data.

• Initial mappings are generated automatically based on the profiling results from where the user can continue to build on.

September 26, 2016

22

LIFTING EXAMPLE

September 26, 2016

23

Bond ISIN Country Industry

NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking

SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical Equipment

Electricite de France (EDF), 6,5% 26jan2019

USF2893TAB29 France Utilities

NEDWBK CAD 5,2%25

fibo:hasSecurityIdentifier

Utilities

Industry Ontology

Banking

France

Country Ontology

Germany

EMEA

“CA639832AA25”

fibo:legallyRecordedIn

fibo:industrySector

LINKING

• Goal: Connect individual datasets to a knowledge graph

• Identify related entities in different datasets and link them• Either entities describing the same real world object or another relation

September 26, 2016

24

NEDWBK CAD 5,2%25

ratingScore

Industry OntologyCountry Ontology

EMEA“AAA”


fibo:industrySector

Rating CAD 5,2%25

hasRating

fibo:industrySector


LINKAGE RULES

• Linking is based on domain-specific rules

• Specify the conditions that must hold true for two entities to be linked

September 26, 2016

25

LEARNING LINKAGE RULES

Problem: Manually writing rules is time-consuming and requires expertise

Approach: Interactive machine learning algorithm for generating rules

• Generates a rule based on a number of user-confirmed link candidates.

• Link candidates are actively selected by the learning algorithm to include link candidates that yield a high information gain.

• The user does not need any knowledge of the characteristics

of the dataset or any particular similarity computation techniques.

September 26, 2016

26

INTEGRATION PROCESS

Dataset Management

•Catalog Datasets


•Manage Metadata

Dataset Discovery

•Data Profiling


Dataset Integration

•Dataset Lifting

•Dataset Linking


Data Access



VIEW GENERATION

• The user selects a set of lifted and linked datasets

September 26, 2016

28

Hadoop Data Lake

DATA ACCESS

• Generate data flows based on Apache Spark

• The data flows utilize Resilient Distributed Datasets (RDDs)

• RDDs derive new data sets from existing data sets by applying a chain of transformations

• A derived data set can either• be recomputed on-the-fly • persisted on stable storage

• Data flows can be executed efficiently on Hadoop clusters.

September 26, 2016

29

CorporateBonds

Data Lifting 1(Apache Spark

RDD)

Data Linking(Apache Spark RDD)

Internal Ratings


RDD)

External Ratings


RDD)

eccenca Corporate

Memory

Data Consumer

SQL CSVExcel

SparkAPI

DEMO

Contact

Dr. Robert Isele

Tel: +49 151 17238616

email: [email protected]

eccencaCommand your Data!

Data & Analytics

eccenca CorporateMemory - Semantically integrated Enterprise Data Lakes