Cloudera presentation at the Chief Analytics Officer, Fall 2016

1 © Cloudera, Inc. All rights reserved.

Data Engineering and Data Science Modern Analytics and Data Processing for the Enterprise


Today, Data is Everything!

Instrumentation

Consumerization

Experimentation

Today, everything that can be measured will be measured.

Today, data IS the application.

Today, becoming data-driven is a business imperative.


“It will soon be technically feasible & affordable to

record & store everything…”

— New York Times

“Digital technologies will, in the near future, accomplish many tasks once considered

uniquely human.” .

— Second Machine Age

Data is abundant, diverse & shared freely

As is how we store, process and analyze it

Streaming Machine Learning BI

ETL Modeling


The new analytics paradigm

Understand why it

happened

Change what

happens next

Determine what

happened

Make it happen

consistently


Modern Data Engineering and Data Science requires a new approach in order to handle more data, faster, with better access and a

simplified architecture.


Apache Hadoop

Hadoop Distributed File System (HDFS)

File Sharing & Data Protection Across Physical Servers

YARN/MapReduce v2

Distributed Computing Across

Physical Servers

Flexibility

•A single repository for storing processing & analyzing any type of data

•Not bound by a single schema

•On Premises and in the Cloud

Scalability + Complex Analysis

•Scale-out architecture divides workloads across multiple nodes

•Flexible file system eliminates ETL bottlenecks

•Real-time analytics

Low Cost •Can be deployed on industry

standard hardware

•Open source platform guards against vendor lock

•1-2 Orders of magnitude less expensive than traditional systems

Apache Hadoop is a platform for data storage and processing that is…

• Distributed • Scalable • Fault tolerant • Open source

(Original) Core Hadoop Components


End to End Lifecycle of Data Science

Data Engineering Data Science Production (Data Engineering / App Development)

Data Wrangling

Visualization and Analysis

Model Training & Testing

Production Model

Preparation Batch Scoring

Online Scoring

Serving

Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing

Data Governance Governance

Processing

Acquisition

Model Quality & Performance

Experiments


Our Goal: Bring More Data Science Users to Hadoop

Help more data scientists

use the power of Hadoop

Use a powerful, familiar

environment with direct access

to Hadoop data and compute

Data Scientist

Data Engineer

Make it easy and secure to

add new users, use cases

Offer secure self-service

analytics and a faster path to

production on common,

affordable infrastructure

Enterprise Architect

Hadoop Admin


Who is Data Engineering for?

• Needs projects to scale • Cares about performance • Cares about SLA’s • Needs multitenancy, security,

and optimized architecture

• Needs better scale • Cares about access to data • Wants better collaboration

without managing dependencies

Data Engineer/ETL Engineer Data Scientist/Data Analyst

• Cares that his team is productive

• Cares about enforcing standards.

• Wants results he can share with the business

Analytics Leader


Requirements of a Data Science Platform

• Leverage Big Data – Volume, Variety, Velocity – to tackle various use cases

• Enable real-time use cases

• Provide sufficient toolset for the Data Analysts

• Provide sufficient toolset for the Data Scientists + Data Engineers

• Provide standard data governance capabilities

• Provide standard security across the stack

• Provide flexible deployment options

• Integrate with partner tools

• Provide management tools that make it easy for IT to deploy/maintain


Cloudera Enterprise, A New Way Forward


Data Engineering and Data Science Workloads

Data Ingestion

(Kafka, Navigator,

Search) Cloudera enables users to build real-time, end-to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security.

Data Processing

(Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets.

Data Science

(Spark MLlib) Cloudera is bringing the most popular data science languages/libraries to our platform for easier collaboration, self-service exploration, and implementation at scale. Cloudera is advancing the state of distributed machine learning at scale. Cloudera enables exploratory data science and the ability to deliver robust data products.


Data Ingestion for Hadoop Ingest Any Data Type at Any Rate

STRUCTURED Sqoop

UNSTRUCTURED Kafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENT YARN

SECURITY Sentry, RecordService

FILESYSTEM HDFS

RELATIONAL Kudu

NoSQL HBase

STORE

INTEGRATE

BATCH Spark, Hive, Pig

MapReduce

STREAM Spark

SQL Impala

SEARCH Solr

SDK Kite

Apache Sqoop: SQL to Hadoop • Efficiently bulk load data (bidirectional) • Easily get started with custom connectors freely available

(RDBMS/EDW/NoSQL)

Apache Flume: Log Aggregation for Hadoop • Efficiently move large amounts of streaming/log data • Reliable, scalable, manageable, and extensible for

production • Connector ecosystem for common streaming data sources • Easily gather logs from multiple systems

Apache Kafka: Pub-Sub Messaging for Hadoop • Move data from many “producers” to many “consumers” • Most flexible to support a wide range of use cases • Integrates with Flume, HBase, Spark, etc


Powerful Data Processing The Most Apache Spark Experience

STRUCTURED Sqoop

UNSTRUCTURED Kafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENT YARN

SECURITY Sentry, RecordService

FILESYSTEM HDFS

RELATIONAL Kudu

NoSQL HBase

STORE

INTEGRATE

BATCH Spark, Hive, Pig

MapReduce

STREAM Spark

SQL Impala

SEARCH Solr

SDK Kite

Spark: Data processing and data science for developers and data scientists • Easy development • Flexible, extensible API • Fast batch and stream processing

Cloudera: Most experience with Spark on Hadoop for instant success • First to ship and support • Most Spark users trained • Most customers running Spark • Most engineering resources (committers, contributors, support) • Only vendor focused on enterprise Spark


Data Science A Unified Platform to Accelerate Data Science from Exploration to Production.

Data Scientists need to use data to…

▪ Explore

▪ Model

▪ Test

The field of data science blends math and statistics knowledge with advanced computer knowledge.

▪ “Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician” Josh Wills


Spark MLlib Collection of mainstream machine learning algorithms built on Spark

Including: •Classifiers: logistic regression, boosted trees, random forests, etc

•Clustering: k-means, Latent Dirichlet Allocation (LDA)

•Recommender Systems: Alternating Least Squares

•Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)

•Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc

•Statistical Functions: Chi-Squared Test, Pearson Correlation, etc


Logistic Regression Performance (Data Fits in Memory)

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Ru

nn

ing

Tim

e(s

)

# of Iterations

MapReduce

Spark

110 s/iteration

First iteration = 80s Further iterations 1s due to caching


End to End Lifecycle of Data Science

Data Engineering Data Science Production (Data Engineering / App Development)

Data Wrangling

Visualization and Analysis

Model Training & Testing

Production Model

Preparation Batch Scoring

Online Scoring

Serving

Dev Tools: IDEs/Notebooks, Collaboration Ops Tools: Versioning, Scheduling, Workflow, Publishing

Data Governance Governance

Processing

Acquisition

Model Quality & Performance

Experiments


Cloudera Data Science Workbench A unified platform to accelerate data science from exploration to production.

1. Team Productivity Cloudera Workbench

2. Automation Cloudera Pipelines

3. Data Products Cloudera Models


Hadoop as a Data Science Platform

• Leverage Big Data

• Enable real-time use cases

• Provide sufficient toolset for the Data Analysts

• Provide sufficient toolset for the Data Scientists + Data Engineers

• Provide standard data governance capabilities

• Provide standard security across the stack

• Provide flexible deployment options

• Integrate with partner tools

• Provide management tools that make it easy for IT to deploy/maintain

Hadoop

Kafka, Spark Streaming, Kudu

Spark, Hive, Impala, Hue

Cloudera Data Science Workbench

Navigator + Partners

Kerberos, Sentry, Record Service, KMS/KTS

Cloudera Director

Rich Ecosystem

Cloudera Manager/Director


Three Core Enterprise Applications

OPERA

TIONS

DATAM

ANAGEM

ENT

UNIFIEDSERVICES

PROCESS,ANALYZE,SERVE

STORE

INTEGRATE

Process data, develop & serve predictive models

Data Engineering & Science

ELT, reporting, exploratory business

intelligence

Analytic Database

Build data-driven applications to deliver

real-time insights

Operational Database


DATA-DRIVEN PRODUCTS

Delivering Improved Cash Flow to Healthcare Providers

• Streamlined transfer of messages between payers and providers

• Reduced cost per terabyte of storage

by 90% • Delivered data encryption and security

protection for HIPAA compliance

HEALTHCARE » PRODUCT IMPROVEMENT » PREDICTIVE ANALYTICS » IT COST REDUCTION


• End-to-end view of data is helping save lives by detecting sepsis early enough for successful treatment

• Has saved 100s of lives already & reduced hospital readmissions

• Centralized data from many systems available in a secure environment

• 2PB+ in multi-tenant environment supporting 100s of clients

Improve Products &

Services Efficiency


Thank you [email protected] linkedin.com/in/jordanvolz

Data & Analytics

Cloudera presentation at the Chief Analytics Officer, Fall 2016