NoSQL Data Modeling
Data Lake From Bare metal to the clouds
IdoFriedman.ymlName: Ido Friedman,Past:[Data platform consultant, Instructor, Team Leader]Present: [Data engineer, Architect]Technologies: [Elasticsearch,CouchBase,MongoDB,Python,Hadoop,SQLand more ]WorkPlace: PerionWhenNotWorking: @Sea
Data lakeThe idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks includingreporting,visualization,analyticsandmachine learning.
What is it?How needs it?How can we access it?What can I get from it?How long do you keep it?
Traditional tools of the trade
Who uses Hadoop ?
Hadoop started developed at 2011 based on a Google White paper from 2004
Data locality!What changed?
Data locality! What does it mean? What changed?Google Cloud Storage Connector for Spark and Hadoop
X3 = 24K$
SQL Won!What changed?
Here to stay
Cloud storage costGoogle BigQuery cuts historical data storage cost in half and accelerates many queries by 10x
From 0.15$ Per GB to 0.015$ Per GB90% changeWhat changed?
Large Data lake on the cloud is possible
COMPUTEENGINEFull control on Hadoop Distribution and ecosystemWill support any weird situation you needNot much less work than on-premises deploymentsHard to make it Pay per use
Moving parts counter Very high
Hadoop as a ServiceSome DevOps and Administration effortsLimited Choice of Hadoop deploymentsEasy to make it Pay per useMoving parts counter Low
0 DevOps and AdministrationNo Hadoop eco systemStructured data support onlyPay per use by designMoving parts counter None ** None that you care about
How do you choose?Tools
THE BIG Question
What affects performanceAs always it dependsCodeYou can change itCPUUsually Slower per coreIOUsually BetterNetworkUsually Better
Give me some Numbers by - http://tech.marksblogg.com/A Billion Taxi Rides ON BigQueryDataProc + PrestoLoad time25 Min3 Hours*Simple Aggregation time2 Sec44 SecCompute cost~0.07$/Query1.14$/Hour
*convert to ORC
Numbers by - http://tech.marksblogg.com/
Very importantCloud storage was about X1.5 slower than HDFShttps://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparisonOptimization#Nodes, Node type, Storage, Compression, File types, File and floor configuration
SummaryNo magic solutions Test your assumptions
Always understand your data and needs
Invest the time on modeling and optimization
What are we doing?
Hadoop on metal