Upload
lynn-langit
View
1.199
Download
0
Embed Size (px)
Citation preview
Cloud Big Data Architectures
Lynn Langit
QCon Sao Paulo, Brazil 2016
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP 1. Big Data Solution Types 2. Data Pipelines 3. ETL and Visualization 4. Bonus…(if time allows)
Save ALL of your Data
“ What is the ACTUAL Cost of ✘ Saving all Data ✘ Using newer technologies ✘ Going beyond Relational
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
1.
Big Data – Yes! But what kind?
Pattern 1
✘ Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL
and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational
and what type of workload for hot, warm or cold data
Choice… is good, right?
“ When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Size Matters
One Vendor’s View
I don’t Want Text here
Where is Hadoop Used?
Hadoop is your LAST CHOICE
✘ Volume ✘ 10 TB or greater to start ✘ Growth of 25% YOY ✘ Where FROM ✘ Where TO
✘ Velocity and Variety ✘ Spark over HIVE ✘ Kafka and Samsa
✘ Veracity ✘ Pay, train and hire team ✘ Top $$$ for talent ✘ IF you can find it ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best
“ When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
225 NoSQL Database Types to Choose From
Let’s review some NoSQL concepts Key-Value Redis, Riak, Aerospike
Graph Neo4j
Document MongoDB
Wide-Column Cassandra, HBase
“
Key Questions - Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of
data stores, existence of authoritative (master) data management?
21
✘ Open Source is Free ✘ Not Free § Rapid iteration, innovation § Can start up for free (on premise) § Can ‘rent’ for cheap or free on the cloud § Can use with the command line for free § Some vendors offer free online training § Ex. www.neo4j.org
§ Constant releases § Can be deceptively hard to set up (time is
money) § Don’t forget to turn it off if on the cloud! § GUI tools, support, training cost $$$ § Ex. www.neo4j.com
NoSQL Example
Practice Applying Concepts - NoSQL
NoSQL Applied
Log Files • ???
Product Catalogs • ???
Social Games • ???
Social aggregators • ???
Line-of-Business • ???
NoSQL Applied
Log Files • Columnstore • HBase
Product Catalogs • Key/Value • Redis
Social Games • Document • MongoDB
Social aggregators • Graph • Neo4j
Line-of-Business • RDBMS • SQL Server
More than NoSQL
NoSQL ✘ Non-relational ✘ Can be optimized in-
memory ✘ Eventually consistent ✘ Schema on Read ✘ Example: Aerospike
NewSQL ✘ Relational plus more ✘ Often in-memory ✘ Some kind of SQL-layer ✘ Schema on Write ✘ Example: MemSQL
U-SQL ✘ What??? ✘ Microsoft’s universal SQL
language ✘ Example: Azure Data Lake
Focus
How Best to Store your Data?
Complexity Scalability Developer Cost
RDBMS easy medium low
NoSQL medium big high
Hadoop hard huge very high
Real World Big Data -- When do I use what?
RDBMS 65%
NoSQL 30%
Hadoop 5%
“ Do the Cloud Vendors Understand
Big Data Realities?
Cloud Big Data Vendors - Storage
AWS ✘ 5-10X market share of next
competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational
GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: Query as a
Service
Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: On-premise
integration
Place your screenshot here
AWS Console 17 Data services
Place your screenshot here
GCP Console 8 Data Services
Place your screenshot here
Azure Console 15 Data Services
Cloud Offerings – Big Data AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL
Data Warehouse Redshift BigQuery Azure SQL Data Warehouse
NoSQL buckets S3 Glacier
Cloud Storage Nearline
Azure Blobs StorSimple
NoSQL Key-Value NoSQL Wide Column
DynamoDB Big Table Cloud Datastore
Azure Tables
NoSQL Document NoSQL Graph
MongoDB on EC2 Neo4j on EC2
MongoDB on GCE Neo4j on GCE
DocumentDB Neo4j on Azure
Hadoop Elastic MapReduce DataProc Data Lake HDInsight
Practice Applying Concepts – Real Cost of Storage Types
Cloud NoSQL Applied – AWS
Log Files
Product Catalogs
Social Games
Social aggregators
Line-of-Business
Cloud NoSQL Applied – AWS
Log Files • Stream or
Hadoop • Kinesis or EMR
Product Catalogs • Key/Value • DynamoDB
Social Games • Document • MongoDB
Social aggregators • Graph • Neo4j
Line-of-Business • RDBMS • RDS
??? The fastest growing cloud-based Big Data products are…
Relational The fastest growing cloud-based Big Data products are…
“ When do I use…? ✘ Hadoop ✘ NoSQL ✘ Big Relational
Practice Applying Concepts – Real Cost of Storage Types
Reasons to use Big Relational Cloud Services
Developers DevOps Cloud Vendors – AWS
Developers DevOps Cloud Vendors – GCP
Reasons to use Big Relational Cloud Services
Developers Most know RDBMS query patterns Many know basic administration
DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization
Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products
Developers Most know coding language patterns to interact with RDBMS systems
DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration
Cloud Vendors - GCP Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration
My top Big Data Cloud Services
ETL is 75% of all Big Data Projects
Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
2.
Data Pipelines Build vs. Buy
Pattern 2
✘ How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds
Key Questions – Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?
Together How does your data pipeline flow?
“ Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Pipeline Phases Phase 0
Eval Current Data - Quality & Quantity Phase 1
Get New Data - Free or Premium Phase 2
Build MVP & Forecast volume and growth Phase 3
Load test at scale Phase 4
Deploy – secure, audit and monitor
Cloud Big Data Vendors - ETL
AWS ✘ 5X market share of next
competitor ✘ Notable: Many, strong ETL
Partners
GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: DataFlow requires
Java or Python developers
Azure ✘ Difficulty with scale ✘ Best tooling integration ✘ Notable: Nothing
How Best to Ingest and ETL your Data?
Complexity Scalability Developer Cost
RDBMS medium medium low
NoSQL medium big high
Hadoop hard huge very high
“ Considering… ✘ Initial Load/Transform ✘ Data Quality ✘ Batch vs. Stream
Building a Streaming Pipeline
Stream Interval Window
“ Near Real-time Streams
Load Test All The Things
Key Questions - Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now
Cloud Big Data Vendors - Streaming
AWS ✘ 5X market share of next
competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Kinesis Firehose
GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Requires top developers ✘ Notable: DataFlow flexible
Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Stream Analytics
integration with other products
Place your screenshot here
AWS Console 17 Data services
Place your screenshot here
GCP Console 8 Data Services
Place your screenshot here
Azure Console 15 Data Services
Cloud Offerings – Data and Pipelines AWS Google Microsoft
Managed RDBMS RDS Aurora Cloud SQL Azure SQL
Data Warehouse Redshift BigQuery Azure SQL Data Warehouse
NoSQL buckets S3 Glacier
Cloud Storage Nearline
Azure Blobs StorSimple
NoSQL Key-Value NoSQL Wide Column
DynamoDB Big Table Cloud Datastore
Azure Tables
Streaming or ML Kinesis AWS Machine Learning
DataFlow Google Machine Learning
StreamInsight Azure ML
NoSQL Document NoSQL Graph
MongoDB on EC2 Neo4j on EC2
MongoDB on GCE Neo4j on GCE
DocumentDB Neo4j on Azure
Hadoop Elastic MapReduce DataProc Data Lake HDInsight
Cloud ETL Data Pipelines DataFlow Azure Data Pipeline
How Best to Stream your Data?
Complexity Scalability Developer Cost
Batches easy medium low
Windows difficult big high
Real-time very difficult huge high
Practice Applying Concepts
Designing Cloud Data Pipelines
Log Files
Product Catalogs
Social Games
Social aggregators
Line-of-Business
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
3.
Making Sense of Data Analytics and Presentation
Pattern 3
✘ How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own
Making Sense of Data
Machine Learning Reports Presentation
Key Questions - Query ✘ Volume ✘ Variety ✘ Velocity ✘ Veracity
Graphs What is nature of your questions?
Cloud Big Data Vendors - Query
AWS ✘ 5X market share of next
competitor ✘ Most complete offering ✘ Most mature offering ✘ Notable: Big Relational
GCP ✘ Lean, mean and cheap ✘ Fastest player ✘ Notable: Flexible, powerful
machine learning
Azure ✘ WATCH OUT – Cost! ✘ Notable: Developer Tooling
Query Languages
SQL Everyone knows it But how well do they know it?
NoSQL Vendor Language Too many to list How will you learn it?
Cypher Query language for graph databases The future?
ORM Good, bad or horrible? Again, how well do they know it?
HIVE Shown in too many vendor demos Really hard to make performant
Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…
Practice Applying Concepts – Understanding D3
How Best to Query your Data?
Business Analytics
Predictive Analytics
Developer Cost
RDBMS
NoSQL
Hadoop
How Best to Query your Data?
Business Analytics
Predictive Analytics
Developer Cost
RDBMS easy medium low
NoSQL hard very hard very high
Hadoop hard hard very high
Machine Learning aka Predictive Analytics
AWS ML for developers GUI-based
GCP 3 Flavors of ML Python-based languages
Azure ML for Data Scientists R Language
Presentation
If you can’t see it, it’s not worth it.
Dashboards ✘ More than KPIs ✘ Mobile ✘ Alerts ✘ Data Stories
Innovation in Data Visualization
Reports ✘ Level of Detail ✘ Meaningful Taxonomies ✘ Fast enough ✘ Drill for Data
D3 The language of Data Visualization
Cloud Big Data Vendors - Visualization
AWS ✘ Most complete offering ✘ Notable: Partners &
QuickSight
GCP ✘ Big Query Partners ✘ Notable: New Dashboards
Azure ✘ Integrated ✘ Notable: PowerBI
About this Workshop
Real-world Cloud Scenarios w/AWS, Azure and GCP 1. When to use which type of Big Data Solution 2. The new world of Data Pipelines 3. ETL and Visualization Practicalities 4. Bonus…(if time allows)
4.
About IoT It’s happening now
Place your screenshot here
Data Generation Device
IoT is
Big Data Realized
235,000,000,000 $ The IoT Market
2017 By the year
20 Billion devices And a lot of users
IoT all the Things
Cloud Big Data Vendors - IoT
AWS ✘ First to market ✘ Most complete offering ✘ Most mature offering ✘ Notable: AWS IoT Rules
GCP ✘ Still in Beta ✘ Fastest player ✘ Requires top developers ✘ Notable: Weave
Azure ✘ Catching up ✘ Best tooling integration ✘ Notable: Device Mgmt.
Save ALL of your Data
The Next Generation…
‘brigada!
Any questions?
You can find me at @lynnlangit