Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1
To maximize advance analytics & machine
learning potential
Vincent Fages-Gouyou,
Director of Product Management EMEA
LeveragingData Virtualization
Agenda1. What are Advanced Analytics?
2. The Data Challenge
3. The Rise of Logical Data Architectures
4. Tackling the Data Pipeline Problem
5. Real-time Machine Learning with Data Virtualization
6. Key Takeaways
7. Q&A
8. Next Steps
4
Analytics Value Escalator
5
The Analytics Chasm
6
The Key Ingredient for Advanced Analytics is Data
Input data for a data science project may come in a variety of systems and formats:
• Files (CSV, logs, Parquet)
• Relational databases (EDW, operational systems)
• NoSQL systems (key-value pairs, document stores, time series, etc.)
• SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.)
In addition, the Big Data community has also embraced data science as one of their
pillars. For example Spark and SparkML, and architectural patterns like the Data Lake
7
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather requirements for the business
problem
2. Identify relevant Data
3. Cleanse Data into a useful format
4. Analyze Data
5. Prepare input for algorithms
6. Execute Data Science algorithms (ML, AI, etc.)
Iterate 1 to 6
7. Visualize and Share
8
Typical Data Science Workflow
80%
10%
10%
Finding and Preparing
the Data
Analysis
Visualizing data
9
Data Access – A Wild West Expedition ?
Finding the right data
Getting access
Understand heterogeneous technologies
(noSQL, REST APIs, etc.)
Transforming into a useable format
Combining multiple sources
Profiling / sanitizing data
Prepare for ML / AI algorithms
Share data, processes and results
Photo by Jasper van der Meij on Unsplash
10
Data Lakes – The Solution?
High initial investment with unclear value
Replication, replication, replication…
with limited value added
Data scientists alone can’t manage it
efficiently, without additional support
Large processing capabilities
Agility
Photo by Aaron Burden on Unsplash
Lower cost of storage
Without governance & management, it can
become a “Data Swamp”
11
Data Virtualization
Mutualized Data Infrastructures
Security, access control & audit
A unique Data Delivery platform for
Data Science, Analytics et APIs
Maximize value out of your existing
technologies (RDBMS, Hadoop, Cloud, etc.)
Optimized Investments
Shorten Time-to-Data
Photo by Tiago Gerken on Unsplash
12
Rise of Logical Architectures
The evolution of Analytical Architectures: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, Gartner April 2018
Logical Data Warehouse: the Path to the Future
Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
14
Data Scientist Workflow Steps
Identify useful
data
Modify data into
a useful format
Analyze data Execute data
science algorithms
(ML, AI, etc.)
Share with
business users
Prepare for
ML algorithm
Security
Governance, Metadata Management, Data Mart
Data Access
Data Virtualization Data Services
15
VizualisationML / AIData ScienceData Quality
Agile Information Architecture
Data Sources
Data Warehouse
noSQL
RDBMS
Federation
Transformation
Abstraction
Data Service Dynamic Query
Optimization
Cost Based
Optimizer
Query
Rewriting
Caching MPP
Security &
Governance
Lifecycle
Management
Data Catalog
Discover
Collaborate
Query
Categorize
16
Gartner, Adopt the Logical Data Warehouse Architecture
to Meet Your Modern Analytical Needs, May 2018
“When designed properly, Data Virtualization can speed data
integration, lower data latency, offer flexibility and reuse, and
reduce data sprawl across dispersed data sources.
Due to its many benefits, Data Virtualization is often the first step
for organizations evolving a traditional, repository-style data
warehouse into a Logical Architecture”
17
Benefits of a Virtual Data Layer
§ A Virtual Layer improves decision making and shortens development cycles
• Surfaces all company data from multiple repositories without the need to replicate all data
into a lake
• Eliminates data silos: allows for on-demand combination of data from multiple sources
§ A Virtual Layer broadens usage of data
• Improves governance and metadata management to avoid “data swamps”
• Decouples data source technology. Access normalized via SQL or web services
• Allows controlled access to the data with low grain security controls
§ A Virtual Layer offers performant access
• Leverages the processing power of the existing sources controlled by Denodo’s optimizer
• Processing of data for sources with no processing capabilities (e.g. files)
• Caching and ingestion engine to persist data when needed
Demonstration
Accelerating the Machine Learning Data
Pipeline with Data Virtualization
18
19
https://flic.kr/p/x8HgrF
Can we predict the usage of the NYC bike
system based on data from previous years?
20
Data Sources – Citibike
21https://flic.kr/p/CYT7SS
There are external factors to consider.
Which ones?
22
Data Sources – NWS Weather Data
23
What We’re Going To Do…
1. Connect to data and have a look
2. Format the data (prep it) so that we can look for significant factors
• e.g. bike trips on different days of week, different months of year, etc.
3. Once we’ve decided on the significant attributes, prepare that data for the ML
algorithm
4. Using Python, read the 2017 data and run it through our ML algorithm for
training
5. Read the 2018 data, test the algorithm
6. Save the results and load them into the Denodo Platform
DATA CONSUMERS
24
DISPARATE DATA SOURCES
SQL Queries
(JDBC, ODBC, ADO.NET)
Web Services
(SOAP, REST, OData)
Web-based catalog
& search
Secure delivery
(SSL/TLS)
DATA CONSUMERS
MPP Processing
Relational Cache
Corporate Security
Monitoring & Auditing
Metadata
Repository
Execution Engine
& Optimizer
A Modern Data Virtualization Architecture
DATA VIRTUALIZATION
Demo
25
26
McCormick Spice
27
McCormick Spice (Cont’d)
Data Services(Data Virtualization)
API Management and Runtime
Semantics & Discovery
Go
ve
rna
nce
Se
curi
ty
System 1 System nExternal
API $
Go
ve
rna
nce
Se
curi
ty
28
McCormick Spice (Cont’d)
Benefits
✓ Timely Information✓ No replication
✓ No need to validate information✓ Better staging for learning
Approach
1. Model requests Specific Modifications/Full Information
2. Model incrementally or fully trains
Algorithms
Backend
SystemsReal-Tim
e
Real-TimeExternal
Systems
1Request Enterprise
Data
Services
2 Collect
train
4 3Receive
Key Takeaways
29
30
Key Takeaways
ü The Denodo Platform makes all kinds of data – from a variety of
data sources – readily available to your data analysts and data
scientists
ü Data virtualization shortens the ‘data wrangling’ phases of
analytics/ML projects, avoiding needing to write ‘data prep’ scripts
in Python, R, etc.
ü It’s easy to access and analyze the data from analytics tools such as
Zeppelin or Jupyter
ü The Denodo Platform enable centralization of data access and
sharing across each data processing stages
31
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
G E T S TA R T E D T O D AY
www.denodo.com/TestDrive
Q&A
Thanks!
www.denodo.com [email protected]
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.