23
This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of Actian. This document is not intended to be binding upon Actian to any particular course of business, pricing, product strategy, and/or development. Actian assumes no responsibility for errors or omissions in this document. Actian shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. Actian does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. Disclaimer

Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of Actian.

This document is not intended to be binding upon Actian to any particular course of business, pricing, product strategy, and/or development. Actian assumes no responsibility for errors or omissions in this document. Actian shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. Actian does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

Disclaimer

Page 2: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Actian Hybrid DataConference2018 London

Page 3: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

ActianHybrid DataConference2018 London

Vidisha Sharma

Actian Vector with DataFlowUsing Machine Learning Algorithms for Business Analytics

Technical Support Engineer

Page 4: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

How do Actian Vector and DataFlow support AI/ML workloads?

What is the impact of Artificial Intelligence/ Machine Learning on analytic databases?

What will be covered ?

4 © 2018 Actian Corporation

Real-world use.

Page 5: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

What is the impact of Artificial Intelligence/Machine Learning on analytic databases?

Page 6: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

What is Machine Learning

6 © 2018 Actian Corporation

Machine Learning: Mathematically intensive systems that learns some Task from its Experience and its Performance becomes better with more experience.Traditional programming: A step by step procedure using an predefined algorithm to solve a specific problem in hand.

Traditional Programing AI/ Machine Learning

Machine Learning is everywhere• Recommender systems (Amazon,

Netflix)• Facebook tagging• Email Spam• Insurance Domain• Banking Domain and so on

Why Machine Learning is getting popular ?• Increase in computational speeds • Too much data generation • High Dimensional data• Faster improvement cycles

compared to manual programming

Page 7: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

What are the implications of AI/ML on the future of analytic databases?

7 © 2018 Actian Corporation

Demographic transformation

✓ 90% of data generated in last 5 years

✓ 1 terabyte of data 8 years back is around 7 petabyte of data today.

✓ More and more data generated by machine, like Internet of Things.

✓ Querying such large amount of data would need new strategies.

Performance requisites

✓ To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data.

✓ Integrating data from different sources will become main focus

Change in client needs

✓ Change in data model as the data already exists

✓ Physical -> Logical-> conceptual

✓ Building new capabilities

Applying AI/ML in use cases

✓ Machine generated data create new use cases

✓ Marketing for business can be done affectively

✓ “What-if” questions asked and answered

Why Actian ?

✓ Equipped for paradigm shift.✓ Actian Vector analytic database✓ DataFlow

Page 8: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

How do Vector and DataFlowsupport AI/ML workloads?

Page 9: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

What is Vector and what it can do ?

Vector• A columnar, relational database designed for reporting

and analytics

• Delivers extreme high performance even on just a single node

• Easy to install and utilize

• Runs on Linux and Windows, 64 bit

• Excellent concurrency and real-time update characteristics

VectorH• VectorH scales from single machine Vector to a cluster

• leveraging the HDFS distributed filesystem and the YARN resource controller.

• The result is a fully capable Vector DBMS which takes advantage of clustered hardware, for massive performance gains.

Page 10: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

DataFlow

• Single platform for end-to-end data access, transformation, preparation, and predictive analysis

• Combines the KNIME (open source data mining platform) drag and drop visual workflow environment and the Actian DataFlow platform

• Eliminates memory constraints, as well as the need for data movement into specific data stores before analytics are run

• Execute on desktop, remote server, or clusters --including Hadoop clusters

• Transform, cleanse and analyze terabytes of data into actionable insights at record-breaking speed on commodity hardware

10 © 2018 Actian Corporation

Page 11: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

DataFlow Concepts

• Operators (nodes) linked together in a directed acyclic graph (DAG)

• Data flows along edges

• Shared nothing architecture

• Provides pipeline parallelism

• Supports data parallelism

• Data scalable

11 © 2018 Actian Corporation

Page 12: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Vector and DataFlow for AI/ML Workloads

▪ Fast parallel data ingestion

▪Access to analytic routines

▪Parallel query execution through Vector

▪Ability to support multiple higher-level interfaces like to Spark, R, Scala, Python other advanced analytics tools

▪Support for ANSI SQL

▪Visualization/dashboard tools (like Tableau, Looker, Qlik) based on that standard

▪Quicker execution cycles for faster iteration

▪Ability to build a workflow through KNIME graphical user interface

▪Powerful speed due to DataFlow executor

▪DataFlow has Capability to run on Hadoop and Non-Hadoop cluster.

12 © 2018 Actian Corporation

Page 13: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Integrating Vector, DataFlow and ML

Page 14: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Used unsupervised learning to make homogenous groups i.e K-mean algorithm to separate data in 3 clustersDecision Trees were used to derive Key Patterns, Which is applied on the cluster to name them as

- Cluster 1: Risk Zone- Cluster 2: Potential Risk Zone- Cluster 3: Safe Zone

After data is labeled, train and test models using various algorithms. Best accuracy came with Logistic RegressionLogistic Regression model used to predict driving behavior.

Use case at a glance

14 © 2018 Actian Corporation

Evaluate driver’s driving behavior which leads to differential pricing of insurance premium, Dynamic assessment helps in claims of approvalP

rob

lem

S

tate

men

t

50 Million records, 24 variables

Dat

aset

an

d

exp

ecta

tio

ns

Predicting Risk Zone of a driver.Potential use of risk zone profile in prescribing Insurance Premium.

Met

ho

do

log

y

Page 15: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Implementation steps

15 © 2018 Actian Corporation

Ingest

Read data from csv and copy to Vector

✓ Read all 50 million rows into Vector using DataFlow✓ Use ‘ Load Actian Vector on Hadoop Direct’ operator, which reads all data directly to Vector.✓ Took around 7 mins to add that data to Vector.✓ There are more ways to add this data to Vector like vwload or using Director.

Page 16: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

16 © 2018 Actian Corporation

Cluster and Label

Use k-Mean to label data(Convert- unsupervised to supervised)

✓ Read about 50% data to make a k-mean cluster.✓ Use Dataflow ‘Type Conversion’ operator to change some variables to categorical variables✓ ‘Cluster Predictor’ assigns input data to appropriate cluster.✓ ‘Drive Fields’ helps in assigning appropriate name to each cluster. ✓ Write the output to Vector

Page 17: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

17 © 2018 Actian Corporation

Train and Test

Use the labeled data; create Logistic model

✓ Read around 1,000,000 rows from database and passed to Logistic Regression Learner✓ Logistic Regression Predictor, predicts a target value using a previously built logistic regression model.✓ Time taken to build the model - 2 min, 15 secs✓ Logistic classification model is written to PMML file

Page 18: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

18 © 2018 Actian Corporation

Classify

Logistic Regression can be used to classify larger dataset

✓ Read original 50 million rows✓ Use PMML file built in stage-3 for predictions✓ Took 1 min and 44 sec to classify remaining 25 million rows and write this classification to database

Page 19: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

19 © 2018 Actian Corporation

Assessment for 50 Million rows, 4GB data for k-mean for various combinations

19 © 2018 Actian Corporation

Vector CSV

DataFlow 2 min, 26 sec(93.06%)

8 min, 13 sec(75.06%)

KNIME 10 min, 7 sec (67.17%)

32 min 6 sec (base)

• k-mean with R and CSV hangs and after 19 mins throws lots of errors.• Imagine how much more advantage it would give as the data increases• Winning combination

Quantitative comparison

Using DataFlow and Vector gives 93% improvement over KNIME and CSV.

Page 20: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

20 © 2018 Actian Corporation20 © 2018 Actian Corporation

Visualization

Page 21: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Conclusions

• Data is growing fast

• Sooner or later Machine Learning will be applied almost everywhere

• Tools with high speed and performance will become instrumental in making right decisions for business

• Vector and DataFlow is an ideal combination

21 © 2018 Actian Corporation

Page 22: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Acknowledgements

Saurabh Mishra & J V Kameshwar Rao, Analytics CoE, ERS, HCLTechnologies

22 © 2018 Actian Corporation

Page 23: Hybrid Data Management, Integration & Analytics - Disclaimer€¦ · To process large data, we need databases which can offer parallelism, fast ingestion, fast processing of data

Thank you!