33

Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

1

Page 2: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

To maximize advance analytics & machine

learning potential

Vincent Fages-Gouyou,

Director of Product Management EMEA

LeveragingData Virtualization

Page 3: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Agenda1. What are Advanced Analytics?

2. The Data Challenge

3. The Rise of Logical Data Architectures

4. Tackling the Data Pipeline Problem

5. Real-time Machine Learning with Data Virtualization

6. Key Takeaways

7. Q&A

8. Next Steps

Page 4: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

4

Analytics Value Escalator

Page 5: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

5

The Analytics Chasm

Page 6: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

6

The Key Ingredient for Advanced Analytics is Data

Input data for a data science project may come in a variety of systems and formats:

• Files (CSV, logs, Parquet)

• Relational databases (EDW, operational systems)

• NoSQL systems (key-value pairs, document stores, time series, etc.)

• SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.)

In addition, the Big Data community has also embraced data science as one of their

pillars. For example Spark and SparkML, and architectural patterns like the Data Lake

Page 7: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

7

Typical Data Science Workflow

A typical workflow for a data scientist is:

1. Gather requirements for the business

problem

2. Identify relevant Data

3. Cleanse Data into a useful format

4. Analyze Data

5. Prepare input for algorithms

6. Execute Data Science algorithms (ML, AI, etc.)

Iterate 1 to 6

7. Visualize and Share

Page 8: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

8

Typical Data Science Workflow

80%

10%

10%

Finding and Preparing

the Data

Analysis

Visualizing data

Page 9: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

9

Data Access – A Wild West Expedition ?

Finding the right data

Getting access

Understand heterogeneous technologies

(noSQL, REST APIs, etc.)

Transforming into a useable format

Combining multiple sources

Profiling / sanitizing data

Prepare for ML / AI algorithms

Share data, processes and results

Photo by Jasper van der Meij on Unsplash

Page 10: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

10

Data Lakes – The Solution?

High initial investment with unclear value

Replication, replication, replication…

with limited value added

Data scientists alone can’t manage it

efficiently, without additional support

Large processing capabilities

Agility

Photo by Aaron Burden on Unsplash

Lower cost of storage

Without governance & management, it can

become a “Data Swamp”

Page 11: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

11

Data Virtualization

Mutualized Data Infrastructures

Security, access control & audit

A unique Data Delivery platform for

Data Science, Analytics et APIs

Maximize value out of your existing

technologies (RDBMS, Hadoop, Cloud, etc.)

Optimized Investments

Shorten Time-to-Data

Photo by Tiago Gerken on Unsplash

Page 12: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

12

Rise of Logical Architectures

The evolution of Analytical Architectures: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, Gartner April 2018

Page 13: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Logical Data Warehouse: the Path to the Future

Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018

Page 14: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

14

Data Scientist Workflow Steps

Identify useful

data

Modify data into

a useful format

Analyze data Execute data

science algorithms

(ML, AI, etc.)

Share with

business users

Prepare for

ML algorithm

Page 15: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Security

Governance, Metadata Management, Data Mart

Data Access

Data Virtualization Data Services

15

VizualisationML / AIData ScienceData Quality

Agile Information Architecture

Data Sources

Data Warehouse

noSQL

RDBMS

Federation

Transformation

Abstraction

Data Service Dynamic Query

Optimization

Cost Based

Optimizer

Query

Rewriting

Caching MPP

Security &

Governance

Lifecycle

Management

Data Catalog

Discover

Collaborate

Query

Categorize

Page 16: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

16

Gartner, Adopt the Logical Data Warehouse Architecture

to Meet Your Modern Analytical Needs, May 2018

“When designed properly, Data Virtualization can speed data

integration, lower data latency, offer flexibility and reuse, and

reduce data sprawl across dispersed data sources.

Due to its many benefits, Data Virtualization is often the first step

for organizations evolving a traditional, repository-style data

warehouse into a Logical Architecture”

Page 17: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

17

Benefits of a Virtual Data Layer

§ A Virtual Layer improves decision making and shortens development cycles

• Surfaces all company data from multiple repositories without the need to replicate all data

into a lake

• Eliminates data silos: allows for on-demand combination of data from multiple sources

§ A Virtual Layer broadens usage of data

• Improves governance and metadata management to avoid “data swamps”

• Decouples data source technology. Access normalized via SQL or web services

• Allows controlled access to the data with low grain security controls

§ A Virtual Layer offers performant access

• Leverages the processing power of the existing sources controlled by Denodo’s optimizer

• Processing of data for sources with no processing capabilities (e.g. files)

• Caching and ingestion engine to persist data when needed

Page 18: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Demonstration

Accelerating the Machine Learning Data

Pipeline with Data Virtualization

18

Page 19: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

19

https://flic.kr/p/x8HgrF

Can we predict the usage of the NYC bike

system based on data from previous years?

Page 20: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

20

Data Sources – Citibike

Page 21: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

21https://flic.kr/p/CYT7SS

There are external factors to consider.

Which ones?

Page 22: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

22

Data Sources – NWS Weather Data

Page 23: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

23

What We’re Going To Do…

1. Connect to data and have a look

2. Format the data (prep it) so that we can look for significant factors

• e.g. bike trips on different days of week, different months of year, etc.

3. Once we’ve decided on the significant attributes, prepare that data for the ML

algorithm

4. Using Python, read the 2017 data and run it through our ML algorithm for

training

5. Read the 2018 data, test the algorithm

6. Save the results and load them into the Denodo Platform

Page 24: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

DATA CONSUMERS

24

DISPARATE DATA SOURCES

SQL Queries

(JDBC, ODBC, ADO.NET)

Web Services

(SOAP, REST, OData)

Web-based catalog

& search

Secure delivery

(SSL/TLS)

DATA CONSUMERS

MPP Processing

Relational Cache

Corporate Security

Monitoring & Auditing

Metadata

Repository

Execution Engine

& Optimizer

A Modern Data Virtualization Architecture

DATA VIRTUALIZATION

Page 25: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Demo

25

Page 26: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

26

McCormick Spice

Page 27: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

27

McCormick Spice (Cont’d)

Data Services(Data Virtualization)

API Management and Runtime

Semantics & Discovery

Go

ve

rna

nce

Se

curi

ty

System 1 System nExternal

API $

Go

ve

rna

nce

Se

curi

ty

Page 28: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

28

McCormick Spice (Cont’d)

Benefits

✓ Timely Information✓ No replication

✓ No need to validate information✓ Better staging for learning

Approach

1. Model requests Specific Modifications/Full Information

2. Model incrementally or fully trains

Algorithms

Backend

SystemsReal-Tim

e

Real-TimeExternal

Systems

1Request Enterprise

Data

Services

2 Collect

train

4 3Receive

Page 29: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Key Takeaways

29

Page 30: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

30

Key Takeaways

ü The Denodo Platform makes all kinds of data – from a variety of

data sources – readily available to your data analysts and data

scientists

ü Data virtualization shortens the ‘data wrangling’ phases of

analytics/ML projects, avoiding needing to write ‘data prep’ scripts

in Python, R, etc.

ü It’s easy to access and analyze the data from analytics tools such as

Zeppelin or Jupyter

ü The Denodo Platform enable centralization of data access and

sharing across each data processing stages

Page 31: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

31

Next Steps

Access Denodo Platform in the Cloud!

Take a Test Drive today!

G E T S TA R T E D T O D AY

www.denodo.com/TestDrive

Page 32: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Q&A

Page 33: Leveraging Data Virtualization - Axians · data sources – readily available to your data analysts and data scientists ü Data virtualization shortens the ‘data wrangling’ phases

Thanks!

www.denodo.com [email protected]

© Copyright Denodo Technologies. All rights reserved

Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,

without prior the written authorization from Denodo Technologies.