Data lake – On Premise VS Cloud

  • View

  • Download

Embed Size (px)

Text of Data lake – On Premise VS Cloud

NoSQL Data Modeling

Ido Friedman

Data Lake From Bare metal to the clouds


IdoFriedman.ymlName: Ido Friedman,Past:[Data platform consultant, Instructor, Team Leader]Present: [Data engineer, Architect]Technologies: [Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQLand more ]WorkPlace: PerionWhenNotWorking: @Sea

Data lakeThe idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks includingreporting,visualization,analyticsandmachine learning.



Raw data

What is it?How needs it?How can we access it?What can I get from it?How long do you keep it?

Traditional tools of the trade


Who uses Hadoop ?

What Changed?

Hadoop started developed at 2011 based on a Google White paper from 2004


Data locality!What changed?



Data locality! What does it mean? What changed?Google Cloud Storage Connector for Spark and Hadoop

X3 = 24K$

SQL Won!What changed?

Here to stay

The consumers

What changed?

Cloud storage costGoogle BigQuery cuts historical data storage cost in half and accelerates many queries by 10x

From 0.15$ Per GB to 0.015$ Per GB90% changeWhat changed?


Large Data lake on the cloud is possible

Deployment Options


Data Proc

Big Table

COMPUTEENGINEFull control on Hadoop Distribution and ecosystemWill support any weird situation you needNot much less work than on-premises deploymentsHard to make it Pay per use

Moving parts counter Very high

Data Proc

Hadoop as a ServiceSome DevOps and Administration effortsLimited Choice of Hadoop deploymentsEasy to make it Pay per useMoving parts counter Low

Big Table

0 DevOps and AdministrationNo Hadoop eco systemStructured data support onlyPay per use by designMoving parts counter None ** None that you care about

How do you choose?Tools


Data Proc

Big Table

Hadoop ecosystemSQLDataStructuredUnstructured


THE BIG Question



What affects performanceAs always it dependsCodeYou can change itCPUUsually Slower per coreIOUsually BetterNetworkUsually Better

Give me some Numbers by - Billion Taxi Rides ON BigQueryDataProc + PrestoLoad time25 Min3 Hours*Simple Aggregation time2 Sec44 SecCompute cost~0.07$/Query1.14$/Hour

Storage cost12.6$/Month5.2$/Month

*convert to ORC

Numbers by -

Very importantCloud storage was about X1.5 slower than HDFS, Node type, Storage, Compression, File types, File and floor configuration

SummaryNo magic solutions Test your assumptions

Always understand your data and needs

Invest the time on modeling and optimization

What are we doing?

Hadoop on metal