25
Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President Engineering SVDS May 4, 2017 Brent Compton Sr. Director Storage Solution Architectures Red Hat

Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Big Data Analytics with Silicon Valley Data Science and Red Hat

Stephen O’SullivanVice President Engineering SVDS

May 4, 2017

Brent ComptonSr. Director Storage Solution ArchitecturesRed Hat

Page 2: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Key Takeaways

● Disaggregating compute from storage provides flexibility

● Many-to-one: multiple analytics clusters to one object store

● On-demand ephemeral compute enables speed to capability

Page 3: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

A BIG DATA PLATFORM PARABLE

Page 4: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

R&D / HOBBYTHIS HADOOP THING IS COOL

● A few servers “under a desk”

● Kicking tires with very small amounts of data

● Learning the new tool sets

Data Size: 100GB – 500GB Cost: $500-$3K, but no license cost

Page 5: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

PROOF OF CONCEPTMAYBE THIS WILL HELP OUR BUSINESS

● “Steel thread” use case to prove platform capabilities

● A few non-prod servers

● Only storing data just for the use case, not a full dataset

● Processing servers where data is stored

Data Size: 1TB – 5TB Cost: $5K-$20K, still no license cost

Page 6: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

INITIAL PRODUCTIONYES, WE CAN USE THIS

● Initial production Hadoop cluster

● Dedicated servers, network infrastructure

● Some operational support

● Full dataset volume and ingesting new data

Data Size: 50TB – 100TB Cost: ~$500K, plus licenses at $4K-$8K per node … and growing

Page 7: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

VICTORY!… OR JUST THE FIRST HURDLE

You’re successful! But…

● More and more data

● Different workloads from different people

● Need to be able to explore, develop, and execute

● As you add nodes for the increased data, different consumers, and new workloads, server, licences, and support costs grow ...

Page 8: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Do you need to scale all that compute ... or is it just storage?

Page 9: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

WHAT ARE WE SEEING NOW?

Page 10: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Object Storage

Processing Clusters

Reliable, Fast Network Infrastructure

DISAGGREGATED COMPUTE AND STORAGE

● Independently scale compute and storage

● Control on licensing costs

● Flexibility for different datacenters / cloud regions or providers

● Ephemeral data labs for exploratory data science work

● No dependency on Hadoop for data access

Page 11: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

WHAT ARE WE SEEING NOW

● Multiple analytics clusters on unified object storage

● Supports production, data science, and data warehouse activities

● Read-only copies of data with lineage

● Ephemeral on-demand environments

Page 12: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

WHAT ARE WE SEEING NOW

Unified object storage gives enterprises flexibility, control, and scalability across workloads

Page 13: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

WHAT’S POSSIBLE—Large European grocery company

• Board-level imperative to:

○ Understand customers better ○ Run stores more effectively

BEFORE

RESULTSSOLUTION• Unified platform with full metadata and data

lineage

• Data consumable by BI and Data Science users using Hadoop, Spark, SAS, Data Warehouse

CHALLENGE• Not all data captured, or being deleted quickly

• Data silos: no central view, lots of different technology, little communication

• No waiting for months to get access

• Company-wide data catalog of all data assets

• Effective cost control means ability to store data for longer periods of time

Page 14: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

WHAT’S POSSIBLE—Large Global Retailer

• Group had different data silos in Oracle

• Already over-subscribed for on-premise Hadoop cluster with a limited data set

• Held hostage by third-party data sources

BEFORE

RESULTSSOLUTION• Moved to cloud with all data persisted in object

store

• Analytics Hadoop clusters and toolkits

• Configurable framework for Load + Transform into different end user reporting clusters

CHALLENGE• Certain customer data sources were being

captured but were difficult to use

• Different teams needed access to data in one place

• Frictionless environment/tools to move data science insights to end user reporting

• Can spin up additional compute environments faster, with data from the object store

Page 15: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

INSTANTIATINGEMERING PATTERNSWITH RED HAT TECHNOLOGY

Page 16: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

COMPUTE

HDFS TMP

COMPUTE

HDFS TMP

COMPUTE

HDFS TMP

Emerging PatternsMultiple analytic clusters, provisioned on-demand, sourcing from a common object store

STORAGE

ANALYTIC CLUSTER 1

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1 ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3

COMMON PERSISTENT STORE

2

3

Page 17: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Addressing Cost Inefficiency at ScaleAdding storage capacity frequently means adding idle compute capacity too

COMPUTE

STORAGE

WORKER

TRADITIONAL HADOOP CLUSTER

COMPUTE

STORAGE

WORKER

COMPUTE

STORAGE

WORKER

COMPUTE

STORAGE

WORKER

COMPUTE

STORAGE

WORKER

idle idle. . .

Page 18: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

COMPUTE

HDFS TMP

COMPUTE

HDFS TMP

COMPUTE

HDFS TMP

Addressing Cost Efficiency at ScaleAdding more storage should require the cost of … adding more storage

RED HAT CEPH STORAGE

ANALYTIC CLUSTER 1 ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3

COMMON PERSISTENT STORE

ADDITIONAL CEPH STORAGE

Page 19: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Addressing Agility Inefficiency at ScaleNew analytics projects frequently require spinning up separate Hadoop/Spark clusters

2

3

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

Page 20: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

HADOOP

HDFS TMP

SPARK

HDFS TMP

PRESTO

HDFS TMP

Addressing Agility Efficiency at ScaleData teams need on-demand provisioning of analytic clusters right-sized for the job

RED HAT CEPH STORAGE

ANALYTIC CLUSTER 1 ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3

COMMON PERSISTENT STORE

RED HAT OPENSTACK PLATFORM on-demand provisioning

Page 21: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

Addressing Cost Inefficiency at ScaleMultiple Hadoop/Spark clusters frequently means buying storage for full datasets on each cluster

2

3

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

COMPUTE

STORAGE

COMPUTE

STORAGE

COMPUTE

STORAGE

WORKER

HADOOP CLUSTER 1

Page 22: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

HADOOP

HDFS TMP

SPARK

HDFS TMP

PRESTO

Addressing Cost Efficiency at ScaleAdding more analytic clusters doesn’t require storing duplicate copies of datasets

RED HAT CEPH STORAGE

ANALYTIC CLUSTER 1 ANALYTIC CLUSTER 2 ANALYTIC CLUSTER 3

COMMON PERSISTENT STORE

Page 23: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

PRESTO

Lab Validation and Benchmarking Underway

RED HAT CEPH STORAGE

ANALYTIC CLUSTER 1 ANALYTIC CLUSTER 2

COMMON PERSISTENT STORE

S3A FILESYSTEM ADAPTOR

KAFKA

SECOR

INGEST

TPC-DS BENCHMARK SUITE(adapted for scale by Silicon Valley Data Science)

RANDOM OBJECTDATASET

GENERATION(SVDS adaptation)

BIG DATALAB PARTNER

HADOOP SPARK

QUERY ENGINES (HIVE, SPARK SQL)

HDFS TMP

Page 24: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President

THANK YOUplus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews

Page 25: Data Science and Red Hat Big Data Analytics with Silicon Valley · 2018-02-06 · Big Data Analytics with Silicon Valley Data Science and Red Hat Stephen O’Sullivan Vice President