Upload
idan-tohami
View
109
Download
3
Embed Size (px)
Citation preview
Ido Friedman
Data Lake From Bare metal to the clouds
IdoFriedman.ymlName: Ido Friedman,Past:[Data platform consultant, Instructor, Team Leader]Present: [Data engineer, Architect]Technologies: [Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQLand more …]WorkPlace: PerionWhenNotWorking: @Sea
Data lakeThe idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.
Goals????
Raw
Raw dataWhat is it?How needs it?How can we access it?What can I get from it?How long do you keep it?
Traditional tools of the trade
SQL
What Changed?Hadoop started developed at 2011 based on a Google White paper from 2004Everything
Data locality!What changed?
20132015
Data locality! What does it mean? What changed?
Google Cloud Storage Connector for Spark and Hadoop
X3 = 24K$
SQL Won!What changed?
Here to stay
The consumersWhat changed?
Cloud storage cost
3/14/2
006
7/25/2
006
12/5/2
006
4/17/2
007
8/28/2
007
1/8/2
008
5/20/2
008
9/30/2
008
2/10/2
009
6/23/2
009
11/3/2
009
3/16/2
010
7/27/2
010
12/7/2
010
4/19/2
011
8/30/2
011
1/10/2
012
5/22/2
012
10/2/2
012
2/12/2
013
6/25/2
013
11/5/2
013
3/18/2
014
7/29/2
014
12/9/2
014
4/21/2
015
9/1/2
015
1/12/2
016$0.00
$0.02
$0.04
$0.06
$0.08
$0.10
$0.12
$0.14
$0.16
Price $/GB
Google BigQuery cuts historical data storage cost in half and accelerates many queries by 10x
From 0.15$ Per GB to 0.015$ Per GB
90% change
What changed?
Conclusion
Large Data lake on the cloud is possible
Deployment Options
Bare Metal Cloud IaaS HadooPaaS DB/WHaaS
COMPUTEENGINE Data Proc Big Table
COMPUTEENGINE
• Full control on Hadoop Distribution and ecosystem• Will support any weird situation you need• Not much less work than on-premises deployments• Hard to make it Pay per use
Moving parts counter – Very high
Data Proc
• Hadoop as a Service• Some DevOps and Administration efforts• Limited Choice of Hadoop deployments• Easy to make it Pay per use
Moving parts counter – Low
Big Table
• 0 DevOps and Administration• No Hadoop eco system• Structured data support only• Pay per use by design
Moving parts counter – None ** None that you care about
How do you choose?Tools
COMPUTEENGINE
Data Proc
Big Table
Hadoop ecosystem
SQL
Data Structured
Unstructured
THE BIG QuestionPerformance
What affects performance
cpu30%
network10%IO
20%
Code40% As always it depends
Code You can change it
CPU Usually Slower per core
IO Usually Better
Network Usually Better
Give me some Numbers by - http://tech.marksblogg.com/
A Billion Taxi Rides – ON …BigQuery DataProc + Presto
Load time 25 Min 3 Hours*
Simple Aggregation time
2 Sec 44 Sec
Compute cost ~0.07$/Query 1.14$/Hour
Storage cost 12.6$/Month 5.2$/Month
*convert to ORC
Numbers by - http://tech.marksblogg.com/
Very importantCloud storage was about X1.5
slower than HDFS
https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison
Optimization
#Nodes, Node type, Storage,
Compression, File types, File
and floor configuration
SummaryNo magic solutions – Test your assumptions
Always understand your data and needs
Invest the time on modeling and optimization
What are we doing?
Hadoop on metal