26
Ido Friedman Data Lake From Bare metal to the clouds

Data lake – On Premise VS Cloud

Embed Size (px)

Citation preview

Page 1: Data lake – On Premise VS Cloud

Ido Friedman

Data Lake From Bare metal to the clouds

Page 2: Data lake – On Premise VS Cloud

IdoFriedman.ymlName: Ido Friedman,Past:[Data platform consultant, Instructor, Team Leader]Present: [Data engineer, Architect]Technologies: [Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQLand more …]WorkPlace: PerionWhenNotWorking: @Sea

Page 3: Data lake – On Premise VS Cloud

Data lakeThe idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.

Page 4: Data lake – On Premise VS Cloud

Goals????

Page 5: Data lake – On Premise VS Cloud

Raw

Page 6: Data lake – On Premise VS Cloud

Raw dataWhat is it?How needs it?How can we access it?What can I get from it?How long do you keep it?

Page 7: Data lake – On Premise VS Cloud

Traditional tools of the trade

SQL

Page 8: Data lake – On Premise VS Cloud

What Changed?Hadoop started developed at 2011 based on a Google White paper from 2004Everything

Page 9: Data lake – On Premise VS Cloud

Data locality!What changed?

20132015

Page 10: Data lake – On Premise VS Cloud

Data locality! What does it mean? What changed?

Google Cloud Storage Connector for Spark and Hadoop

X3 = 24K$

Page 11: Data lake – On Premise VS Cloud

SQL Won!What changed?

Here to stay

Page 12: Data lake – On Premise VS Cloud

The consumersWhat changed?

Page 13: Data lake – On Premise VS Cloud

Cloud storage cost

3/14/2

006

7/25/2

006

12/5/2

006

4/17/2

007

8/28/2

007

1/8/2

008

5/20/2

008

9/30/2

008

2/10/2

009

6/23/2

009

11/3/2

009

3/16/2

010

7/27/2

010

12/7/2

010

4/19/2

011

8/30/2

011

1/10/2

012

5/22/2

012

10/2/2

012

2/12/2

013

6/25/2

013

11/5/2

013

3/18/2

014

7/29/2

014

12/9/2

014

4/21/2

015

9/1/2

015

1/12/2

016$0.00

$0.02

$0.04

$0.06

$0.08

$0.10

$0.12

$0.14

$0.16

Price $/GB

Google BigQuery cuts historical data storage cost in half and accelerates many queries by 10x

From 0.15$ Per GB to 0.015$ Per GB

90% change

What changed?

Page 14: Data lake – On Premise VS Cloud

Conclusion

Large Data lake on the cloud is possible

Page 15: Data lake – On Premise VS Cloud

Deployment Options

Bare Metal Cloud IaaS HadooPaaS DB/WHaaS

COMPUTEENGINE Data Proc Big Table

Page 16: Data lake – On Premise VS Cloud

COMPUTEENGINE

• Full control on Hadoop Distribution and ecosystem• Will support any weird situation you need• Not much less work than on-premises deployments• Hard to make it Pay per use

Moving parts counter – Very high

Page 17: Data lake – On Premise VS Cloud

Data Proc

• Hadoop as a Service• Some DevOps and Administration efforts• Limited Choice of Hadoop deployments• Easy to make it Pay per use

Moving parts counter – Low

Page 18: Data lake – On Premise VS Cloud

Big Table

• 0 DevOps and Administration• No Hadoop eco system• Structured data support only• Pay per use by design

Moving parts counter – None ** None that you care about

Page 19: Data lake – On Premise VS Cloud

How do you choose?Tools

COMPUTEENGINE

Data Proc

Big Table

Hadoop ecosystem

SQL

Data Structured

Unstructured

Page 20: Data lake – On Premise VS Cloud

THE BIG QuestionPerformance

Page 21: Data lake – On Premise VS Cloud

What affects performance

cpu30%

network10%IO

20%

Code40% As always it depends

Code You can change it

CPU Usually Slower per core

IO Usually Better

Network Usually Better

Page 22: Data lake – On Premise VS Cloud

Give me some Numbers by - http://tech.marksblogg.com/

A Billion Taxi Rides – ON …BigQuery DataProc + Presto

Load time 25 Min 3 Hours*

Simple Aggregation time

2 Sec 44 Sec

Compute cost ~0.07$/Query 1.14$/Hour

Storage cost 12.6$/Month 5.2$/Month

*convert to ORC

Page 23: Data lake – On Premise VS Cloud

Numbers by - http://tech.marksblogg.com/

Very importantCloud storage was about X1.5

slower than HDFS

https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison

Optimization

#Nodes, Node type, Storage,

Compression, File types, File

and floor configuration

Page 24: Data lake – On Premise VS Cloud

SummaryNo magic solutions – Test your assumptions

Always understand your data and needs

Invest the time on modeling and optimization

Page 25: Data lake – On Premise VS Cloud

What are we doing?

Hadoop on metal

Page 26: Data lake – On Premise VS Cloud