Top Trends in Building Data Lakes for Machine Learning and AI

Big Data in the CloudState of the Union and Future Trends

Ashish Thusoo Qubole CEO & Co-Founder

00Copyright 2017 © Qubole

About Me

Today - CEO & Co-founder of Qubole

➢ Big Data Platform built for the Cloud

• Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle)

• Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc)

• Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines

Background

➢ Started career at Oracle as a Principal Architect

➢ Ran Data Infrastructure team at Facebook from 2007-2011:

• Built out the Self-service Big Data Platform at Facebook for internal operations

• Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics,

Engineering, Sales Finance, etc.)

• Spawned developments of big data engines such as Apache Hive and Apache Presto

➢ Co-created and led Apache Hive project

From Data Warehouse to Data LakeEvolution of Big Data


Changing Nature of Data


Changing Nature of Analytics

Descriptive Analytics

Diagnostic Analytic

Predictive Analytics

Prescriptive Analytics

What

happened?

Why did it

happen?

What will

happen?

How can we

make it happen?

DIFFICULTY

VA

LU

E

Analytics Value Escalator


Breakdown of Data Warehouse – Emergence of Data Lake

ENTERPRISE DATA

WAREHOUSEETL

BIData

Mining

DATA LAKEINGEST

ELT MLData

Discovery

Data

Warehouse

LEGACY DATA PLATFORM MODEREN DATA PLATFORM


Differences Between Data Lakes and Data Warehouses

DATA LAKE vs DATA WAREHOUSE

Semi-structured / unstructured /

structured / raw DATA

Structured data

SQL / Machine Learning / ETL /

Graph Analytics etc. ANALYTICS FLEXIBILITY

SQL

Cheap storage for large volumes

of data VOLUME

Expensive at large volumes of

data

High agility with ability to quickly

reconfigure for new workloads AGILITY

Fixed configuration and limited

agility

Data Engineers / Data Scientists /

Analysts USERS

Analysts / Business Users


Data Back Office with a Data Warehouse Centric Approach

Marketing

OpsDeveloperSales

Ops

Product

Managers

Data Infrastructure

Data Team

Operations

Analyst

AnalystData

Architect

Business

Users

Product

Support

Customer

Support


Data Back Office Transformation with a Data Lake

Operations

Analyst

Marketing Ops

Analyst

Data

Architect

Business

Users

Product

SupportCustomer

Support

Developer

Sales Ops

Product

Managers

Data

Infrastructure

State of Data Lake Adoption in Enterprises


The Five Stages of Data Lake Maturity

01Stage

Aspiration

- Production

Reporting/DW

- Researching

02Stage

Experimentation

- Initial Big Data

Deployment

- Targeted Use

Case

03Stage

Expansion

- Multiple

Departments

- Multiple Engines

- Top Down Use

Cases

04Stage

Inversion

- Enterprise

Transformation

- Bottoms up

use cases

-

05Stage

Nirvana

- Digital

Enterprise

- Ubiquitous

Insights

- True Business

Transformation


Is there a plan in place

to move to a big data

self-service analytics

model?Yes 65%

No 35%

Data Lake’s Reality Gap: Everyone wants self-service nirvana

65% Moving to Self-Service Model to Enable Data Professionals


How confident are you

that the data team can

achieve self-service

analytics?

16% 39% 32% 11% 2%

0% 20% 40% 60% 80% 100%

Extremely confident

Very confident

Confident

Somewhat doubtful

Extremely doubtful

Data Lake’s Reality Gap: IT is confident they can get there

87% Confident They Can Provide Self-Service Analytics


Data Lake’s Reality Gap: Only 8% are there today

How do you assess

your big data

maturity?

In process but still have a lot to learn

35%

Need to scale and optimize our processes

28%

Still figuring it out 22%

Fully mature process providing tools and access to insights for all

stakeholders8%

We have proven processes but still

can’t meet business demand

7%

Only 8% Have Mature Big Data Processes


A Prescription to Success - Move to the Cloud

Adaptability• Best machine configuration for the workload

• Best Engine for the workload

• On demand and elastic; automatically scale up

or down

Agility• Initial provisioning in min/hours, not months

• Change configurations dynamically

• Compute and Storage scale independently

Cost• Pay only for what you actually use

• Use spot instances to reduce cost by up to 80%

Cloud Computing and Data LakesInfrastructure as a Service


Changing Nature of the IT Infrastructure

Mainframe Computing

Client-Server Computing

Cloud Computing

1960 1980 20002020


Cloud vs Data Centers

• Infrastructure is an API – Therefore Infrastructure can with the Application

App

STATIC & RIGIDFLEXIBLE & ELASTIC

App


Properties of Data Lakes

Big Data Analysis Is

Bursty

e.g. at Qubole we see

on an average the

minimum to maximum

size of infrastructure

to vary by 3400%

Ever

Expanding

e.g. data

processed on

Qubole as grown

2.5x in a year

Rapidly

Evolving

e.g. Spark,

Presto and

others

addressing gaps

in technology


Creating Data Lakes in the Cloud

+

1. Faster Time to Production (Minutes and Hours instead of 6-18 months)

2. Increasing success in Big Data and Data Science projects

3. Agility and Flexibility of Infrastructure to change compute for workloads and

technology

4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)


Key Architecture – Separation of Compute and Storage

COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE

COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE

OBJECT STORAGE

ELASTIC COMPUTE

LEGACY DATA CENTER ARCHITECTURE CLOUD INFRASTRUCTURE ARCHITECTURE


Architecture – Putting Together Cloud Data Lakes


Cloud Data Lakes vs Data Center Data Lakes - Automation

Cluster Lifecycle Management

Auto start/terminate

Auto-scaling up/down

Performance Optimization

Cluster re-balancing

Performance/Caching

Cost OptimizationSpot node usageResource substitution


Cloud Data Lakes vs Data Center Data Lakes - TCO

25

5

10

15

20

Clu

ste

r siz

e (

No

de

s)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Time

Minimum nodes

Maximum nodes

On Prem Cluster (fixed)

Actual Compute Usage

Cluster Lifecycle Savings (19% / 90%)

Workload Aware Auto-Scaling Savings (48% / 90%)

Spot Shopper Savings (15% / 45%)


Cloud Data Lakes vs Data Center Data Lakes – Concurrency and Elasticity

select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.countyfrom SMALL_TABLE a) tgroup by t.county;

insert overwrite table dest

select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on

(a.id=b.ad_id) group by a.id, a.zip;OBJECT STORAGE

ELASTIC COMPUTE


Bringing it Together

Using the Cloud for Big Data Platforms and Data Science Operations is

Fundamentally Different from Operating Big Data Platforms On-Premise.

Done Properly This Leads to

1. Faster Time to Value

2. Increased Flexibility and Scale

3. Better Adoption of Analytics

4. Ability to Spend Smarter, and Not Less

Example Cases of Success


Case Studies

Reduced Time to Market for New Products - Ibotta, Inc.

➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints

of being able to scale and leaving new data on the floor, which was prevent new features such as

personalized app experiences and campaign optimizations. This drove building a datalake, being the

source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for

feature engineering.

Bursty ETL on Telemetry & User Data -

➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft

generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough

to process a significant amount more on weekends in order to deliver the right insights back to their

users.


Case Studies

Data Governance for Customer 360 -

➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data

silos in order to provide a 360 view of their engagement from campaigns through viewership. Using

QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing

sensitive data to the wrong teams, while enabling them the right access for their tools.

Data Science Development Workbench -

➔ Acxiom had a need to share data across different data science teams with the control to lock down

certain users. With Qubole, they have built a “Data Science Playground” that allows various DS

teams have a single environment for developing and testing models and algorithms before rolling

them into production.


Right Tool for the Right Team

Built for Anyone Who Uses Data

Analysts

Data Scientists

Data Engineers

Data Admins

BI teams

Executive Dashboards

Products

A Single Platform for Any Use Case

Reporting

Ad Hoc Queries

Machine Learning

Stream Processing

Vertical Apps

ETL Workflows

Open Source Engines, Optimized for the Cloud

Cloud-Native, Optimized for Agnostics

Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases)

Workflows orchestration (Airflow, Oozie, Azkaban)

Future Directions


(Re)Emergence of Deep Learning


Deep Learning Applications

Applications today focused in areas around

• Image Recognition and Processing

• Speech Recognition and Processing

• NLP and Text Analysis


Emergence of New Use Cases and Technology

Deep Learning Platforms are Emerging


Serverless Computing


Advantages of Serverless

• Zero Administration

• Fast Bursting

• Cost Advantages


Emergence of Serverless Computing

Let’s Chat…

[email protected]

Learn More at www.qubole.com

@qubole

Data & Analytics

Top Trends in Building Data Lakes for Machine Learning and AI