Upload
holden-ackerman
View
172
Download
1
Embed Size (px)
Citation preview
Big Data in the CloudState of the Union and Future Trends
Ashish Thusoo Qubole CEO & Co-Founder
00Copyright 2017 © Qubole
About Me
Today - CEO & Co-founder of Qubole
➢ Big Data Platform built for the Cloud
• Scaling to process 700+ PB of data a month in popular Clouds (AWS, Microsoft Azure, Oracle)
• Provides Auto-Orchestration and Self-Service for big data jobs (e.g. ETL, Machine Learning, Adhoc)
• Cloud and workload optimized Spark, Hive, Hadoop and Presto Engines
Background
➢ Started career at Oracle as a Principal Architect
➢ Ran Data Infrastructure team at Facebook from 2007-2011:
• Built out the Self-service Big Data Platform at Facebook for internal operations
• Saw huge growth, while chartered to provide all teams unified analytics (Marketing, Analytics,
Engineering, Sales Finance, etc.)
• Spawned developments of big data engines such as Apache Hive and Apache Presto
➢ Co-created and led Apache Hive project
From Data Warehouse to Data LakeEvolution of Big Data
00Copyright 2017 © Qubole
Changing Nature of Data
00Copyright 2017 © Qubole
Changing Nature of Analytics
Descriptive Analytics
Diagnostic Analytic
Predictive Analytics
Prescriptive Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we
make it happen?
DIFFICULTY
VA
LU
E
Analytics Value Escalator
00Copyright 2017 © Qubole
Breakdown of Data Warehouse – Emergence of Data Lake
ENTERPRISE DATA
WAREHOUSEETL
BIData
Mining
DATA LAKEINGEST
ELT MLData
Discovery
Data
Warehouse
LEGACY DATA PLATFORM MODEREN DATA PLATFORM
00Copyright 2017 © Qubole
Differences Between Data Lakes and Data Warehouses
DATA LAKE vs DATA WAREHOUSE
Semi-structured / unstructured /
structured / raw DATA
Structured data
SQL / Machine Learning / ETL /
Graph Analytics etc. ANALYTICS FLEXIBILITY
SQL
Cheap storage for large volumes
of data VOLUME
Expensive at large volumes of
data
High agility with ability to quickly
reconfigure for new workloads AGILITY
Fixed configuration and limited
agility
Data Engineers / Data Scientists /
Analysts USERS
Analysts / Business Users
00Copyright 2017 © Qubole
Data Back Office with a Data Warehouse Centric Approach
Marketing
OpsDeveloperSales
Ops
Product
Managers
Data Infrastructure
Data Team
Operations
Analyst
AnalystData
Architect
Business
Users
Product
Support
Customer
Support
00Copyright 2017 © Qubole
Data Back Office Transformation with a Data Lake
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
SupportCustomer
Support
Developer
Sales Ops
Product
Managers
Data
Infrastructure
State of Data Lake Adoption in Enterprises
00Copyright 2017 © Qubole
The Five Stages of Data Lake Maturity
01Stage
Aspiration
- Production
Reporting/DW
- Researching
02Stage
Experimentation
- Initial Big Data
Deployment
- Targeted Use
Case
03Stage
Expansion
- Multiple
Departments
- Multiple Engines
- Top Down Use
Cases
04Stage
Inversion
- Enterprise
Transformation
- Bottoms up
use cases
-
05Stage
Nirvana
- Digital
Enterprise
- Ubiquitous
Insights
- True Business
Transformation
00Copyright 2017 © Qubole
Is there a plan in place
to move to a big data
self-service analytics
model?Yes 65%
No 35%
Data Lake’s Reality Gap: Everyone wants self-service nirvana
65% Moving to Self-Service Model to Enable Data Professionals
00Copyright 2017 © Qubole
How confident are you
that the data team can
achieve self-service
analytics?
16% 39% 32% 11% 2%
0% 20% 40% 60% 80% 100%
Extremely confident
Very confident
Confident
Somewhat doubtful
Extremely doubtful
Data Lake’s Reality Gap: IT is confident they can get there
87% Confident They Can Provide Self-Service Analytics
00Copyright 2017 © Qubole
Data Lake’s Reality Gap: Only 8% are there today
How do you assess
your big data
maturity?
In process but still have a lot to learn
35%
Need to scale and optimize our processes
28%
Still figuring it out 22%
Fully mature process providing tools and access to insights for all
stakeholders8%
We have proven processes but still
can’t meet business demand
7%
Only 8% Have Mature Big Data Processes
00Copyright 2017 © Qubole
A Prescription to Success - Move to the Cloud
Adaptability• Best machine configuration for the workload
• Best Engine for the workload
• On demand and elastic; automatically scale up
or down
Agility• Initial provisioning in min/hours, not months
• Change configurations dynamically
• Compute and Storage scale independently
Cost• Pay only for what you actually use
• Use spot instances to reduce cost by up to 80%
Cloud Computing and Data LakesInfrastructure as a Service
00Copyright 2017 © Qubole
Changing Nature of the IT Infrastructure
Mainframe Computing
Client-Server Computing
Cloud Computing
1960 1980 20002020
00Copyright 2017 © Qubole
Cloud vs Data Centers
• Infrastructure is an API – Therefore Infrastructure can with the Application
App
STATIC & RIGIDFLEXIBLE & ELASTIC
App
00Copyright 2017 © Qubole
Properties of Data Lakes
Big Data Analysis Is
Bursty
e.g. at Qubole we see
on an average the
minimum to maximum
size of infrastructure
to vary by 3400%
Ever
Expanding
e.g. data
processed on
Qubole as grown
2.5x in a year
Rapidly
Evolving
e.g. Spark,
Presto and
others
addressing gaps
in technology
00Copyright 2017 © Qubole
Creating Data Lakes in the Cloud
+
1. Faster Time to Production (Minutes and Hours instead of 6-18 months)
2. Increasing success in Big Data and Data Science projects
3. Agility and Flexibility of Infrastructure to change compute for workloads and
technology
4. Significant Cost Reduction (Infrastructure fits avg utilization as opposed to peak)
00Copyright 2017 © Qubole
Key Architecture – Separation of Compute and Storage
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
COMPUTE AND STORAGE COMPUTE AND STORAGE COMPUTE AND STORAGE
OBJECT STORAGE
ELASTIC COMPUTE
LEGACY DATA CENTER ARCHITECTURE CLOUD INFRASTRUCTURE ARCHITECTURE
00Copyright 2017 © Qubole
Architecture – Putting Together Cloud Data Lakes
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes - Automation
Cluster Lifecycle Management
Auto start/terminate
Auto-scaling up/down
Performance Optimization
Cluster re-balancing
Performance/Caching
Cost OptimizationSpot node usageResource substitution
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes - TCO
25
5
10
15
20
Clu
ste
r siz
e (
No
de
s)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Time
Minimum nodes
Maximum nodes
On Prem Cluster (fixed)
Actual Compute Usage
Cluster Lifecycle Savings (19% / 90%)
Workload Aware Auto-Scaling Savings (48% / 90%)
Spot Shopper Savings (15% / 45%)
00Copyright 2017 © Qubole
Cloud Data Lakes vs Data Center Data Lakes – Concurrency and Elasticity
select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.countyfrom SMALL_TABLE a) tgroup by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on
(a.id=b.ad_id) group by a.id, a.zip;OBJECT STORAGE
ELASTIC COMPUTE
00Copyright 2017 © Qubole
Bringing it Together
Using the Cloud for Big Data Platforms and Data Science Operations is
Fundamentally Different from Operating Big Data Platforms On-Premise.
Done Properly This Leads to
1. Faster Time to Value
2. Increased Flexibility and Scale
3. Better Adoption of Analytics
4. Ability to Spend Smarter, and Not Less
Example Cases of Success
00Copyright 2017 © Qubole
Case Studies
Reduced Time to Market for New Products - Ibotta, Inc.
➔ Ibotta was faced with the challenges and limitations of a cloud Data Warehouse, reaching constraints
of being able to scale and leaving new data on the floor, which was prevent new features such as
personalized app experiences and campaign optimizations. This drove building a datalake, being the
source of ETL pipelines from ingestion, along with enabling Data Science and Product teams for
feature engineering.
Bursty ETL on Telemetry & User Data -
➔ Lyft crunches through all the data that an application that their drivers and riders create. Lyft
generates 3-5x more data on weekends vs. weekdays. Their data platform has to be flexible enough
to process a significant amount more on weekends in order to deliver the right insights back to their
users.
00Copyright 2017 © Qubole
Case Studies
Data Governance for Customer 360 -
➔ Turner had a initiative for giving Engineering, Data Science and Analyst breaking down their data
silos in order to provide a 360 view of their engagement from campaigns through viewership. Using
QDS, they could easily set up governance and permissions in S3, eliminating the risk of exposing
sensitive data to the wrong teams, while enabling them the right access for their tools.
Data Science Development Workbench -
➔ Acxiom had a need to share data across different data science teams with the control to lock down
certain users. With Qubole, they have built a “Data Science Playground” that allows various DS
teams have a single environment for developing and testing models and algorithms before rolling
them into production.
00Copyright 2017 © Qubole
Right Tool for the Right Team
Built for Anyone Who Uses Data
Analysts
Data Scientists
Data Engineers
Data Admins
BI teams
Executive Dashboards
Products
A Single Platform for Any Use Case
Reporting
Ad Hoc Queries
Machine Learning
Stream Processing
Vertical Apps
ETL Workflows
Open Source Engines, Optimized for the Cloud
Cloud-Native, Optimized for Agnostics
Web App User Interface, SDK (Python, Java, R, Ruby APIs), ODBC/JDBC connectors (visualization software and databases)
Workflows orchestration (Airflow, Oozie, Azkaban)
Future Directions
00Copyright 2017 © Qubole
(Re)Emergence of Deep Learning
00Copyright 2017 © Qubole
Deep Learning Applications
Applications today focused in areas around
• Image Recognition and Processing
• Speech Recognition and Processing
• NLP and Text Analysis
00Copyright 2017 © Qubole
Emergence of New Use Cases and Technology
Deep Learning Platforms are Emerging
00Copyright 2017 © Qubole
Serverless Computing
00Copyright 2017 © Qubole
Advantages of Serverless
• Zero Administration
• Fast Bursting
• Cost Advantages
00Copyright 2017 © Qubole
Emergence of Serverless Computing