Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts...

Preview:

Citation preview

1

To maximize advance analytics & machine

learning potential

Vincent Fages-Gouyou,

Director of Product Management EMEA

LeveragingData Virtualization

Agenda1. What are Advanced Analytics?

2. The Data Challenge

3. The Rise of Logical Data Architectures

4. Tackling the Data Pipeline Problem

5. Real-time Machine Learning with Data Virtualization

6. Key Takeaways

7. Q&A

8. Next Steps

4

Analytics Value Escalator

5

The Analytics Chasm

6

The Key Ingredient for Advanced Analytics is Data

Input data for a data science project may come in a variety of systems and formats:

• Files (CSV, logs, Parquet)

• Relational databases (EDW, operational systems)

• NoSQL systems (key-value pairs, document stores, time series, etc.)

• SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.)

In addition, the Big Data community has also embraced data science as one of their

pillars. For example Spark and SparkML, and architectural patterns like the Data Lake

7

Typical Data Science Workflow

A typical workflow for a data scientist is:

1. Gather requirements for the business

problem

2. Identify relevant Data

3. Cleanse Data into a useful format

4. Analyze Data

5. Prepare input for algorithms

6. Execute Data Science algorithms (ML, AI, etc.)

Iterate 1 to 6

7. Visualize and Share

8

Typical Data Science Workflow

80%

10%

10%

Finding and Preparing

the Data

Analysis

Visualizing data

9

Data Access – A Wild West Expedition ?

Finding the right data

Getting access

Understand heterogeneous technologies

(noSQL, REST APIs, etc.)

Transforming into a useable format

Combining multiple sources

Profiling / sanitizing data

Prepare for ML / AI algorithms

Share data, processes and results

Photo by Jasper van der Meij on Unsplash

10

Data Lakes – The Solution?

High initial investment with unclear value

Replication, replication, replication…

with limited value added

Data scientists alone can’t manage it

efficiently, without additional support

Large processing capabilities

Agility

Photo by Aaron Burden on Unsplash

Lower cost of storage

Without governance & management, it can

become a “Data Swamp”

11

Data Virtualization

Mutualized Data Infrastructures

Security, access control & audit

A unique Data Delivery platform for

Data Science, Analytics et APIs

Maximize value out of your existing

technologies (RDBMS, Hadoop, Cloud, etc.)

Optimized Investments

Shorten Time-to-Data

Photo by Tiago Gerken on Unsplash

12

Rise of Logical Architectures

The evolution of Analytical Architectures: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, Gartner April 2018

Logical Data Warehouse: the Path to the Future

Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018

14

Data Scientist Workflow Steps

Identify useful

data

Modify data into

a useful format

Analyze data Execute data

science algorithms

(ML, AI, etc.)

Share with

business users

Prepare for

ML algorithm

Security

Governance, Metadata Management, Data Mart

Data Access

Data Virtualization Data Services

15

VizualisationML / AIData ScienceData Quality

Agile Information Architecture

Data Sources

Data Warehouse

noSQL

RDBMS

Federation

Transformation

Abstraction

Data Service Dynamic Query

Optimization

Cost Based

Optimizer

Query

Rewriting

Caching MPP

Security &

Governance

Lifecycle

Management

Data Catalog

Discover

Collaborate

Query

Categorize

16

Gartner, Adopt the Logical Data Warehouse Architecture

to Meet Your Modern Analytical Needs, May 2018

“When designed properly, Data Virtualization can speed data

integration, lower data latency, offer flexibility and reuse, and

reduce data sprawl across dispersed data sources.

Due to its many benefits, Data Virtualization is often the first step

for organizations evolving a traditional, repository-style data

warehouse into a Logical Architecture”

17

Benefits of a Virtual Data Layer

§ A Virtual Layer improves decision making and shortens development cycles

• Surfaces all company data from multiple repositories without the need to replicate all data

into a lake

• Eliminates data silos: allows for on-demand combination of data from multiple sources

§ A Virtual Layer broadens usage of data

• Improves governance and metadata management to avoid “data swamps”

• Decouples data source technology. Access normalized via SQL or web services

• Allows controlled access to the data with low grain security controls

§ A Virtual Layer offers performant access

• Leverages the processing power of the existing sources controlled by Denodo’s optimizer

• Processing of data for sources with no processing capabilities (e.g. files)

• Caching and ingestion engine to persist data when needed

Demonstration

Accelerating the Machine Learning Data

Pipeline with Data Virtualization

18

19

https://flic.kr/p/x8HgrF

Can we predict the usage of the NYC bike

system based on data from previous years?

20

Data Sources – Citibike

21https://flic.kr/p/CYT7SS

There are external factors to consider.

Which ones?

22

Data Sources – NWS Weather Data

23

What We’re Going To Do…

1. Connect to data and have a look

2. Format the data (prep it) so that we can look for significant factors

• e.g. bike trips on different days of week, different months of year, etc.

3. Once we’ve decided on the significant attributes, prepare that data for the ML

algorithm

4. Using Python, read the 2017 data and run it through our ML algorithm for

training

5. Read the 2018 data, test the algorithm

6. Save the results and load them into the Denodo Platform

DATA CONSUMERS

24

DISPARATE DATA SOURCES

SQL Queries

(JDBC, ODBC, ADO.NET)

Web Services

(SOAP, REST, OData)

Web-based catalog

& search

Secure delivery

(SSL/TLS)

DATA CONSUMERS

MPP Processing

Relational Cache

Corporate Security

Monitoring & Auditing

Metadata

Repository

Execution Engine

& Optimizer

A Modern Data Virtualization Architecture

DATA VIRTUALIZATION

Demo

25

26

McCormick Spice

27

McCormick Spice (Cont’d)

Data Services(Data Virtualization)

API Management and Runtime

Semantics & Discovery

Go

ve

rna

nce

Se

curi

ty

System 1 System nExternal

API $

Go

ve

rna

nce

Se

curi

ty

28

McCormick Spice (Cont’d)

Benefits

✓ Timely Information✓ No replication

✓ No need to validate information✓ Better staging for learning

Approach

1. Model requests Specific Modifications/Full Information

2. Model incrementally or fully trains

Algorithms

Backend

SystemsReal-Tim

e

Real-TimeExternal

Systems

1Request Enterprise

Data

Services

2 Collect

train

4 3Receive

Key Takeaways

29

30

Key Takeaways

ü The Denodo Platform makes all kinds of data – from a variety of

data sources – readily available to your data analysts and data

scientists

ü Data virtualization shortens the ‘data wrangling’ phases of

analytics/ML projects, avoiding needing to write ‘data prep’ scripts

in Python, R, etc.

ü It’s easy to access and analyze the data from analytics tools such as

Zeppelin or Jupyter

ü The Denodo Platform enable centralization of data access and

sharing across each data processing stages

31

Next Steps

Access Denodo Platform in the Cloud!

Take a Test Drive today!

G E T S TA R T E D T O D AY

www.denodo.com/TestDrive

Q&A

Thanks!

www.denodo.com info@denodo.com

© Copyright Denodo Technologies. All rights reserved

Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,

without prior the written authorization from Denodo Technologies.

Recommended