95
Cloud Big Data Architectures Lynn Langit QCon Sao Paulo, Brazil 2016

Cloud Big Data Architectures

Embed Size (px)

Citation preview

Page 1: Cloud Big Data Architectures

Cloud Big Data Architectures

Lynn Langit

QCon Sao Paulo, Brazil 2016

Page 2: Cloud Big Data Architectures

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  Big Data Solution Types 2.  Data Pipelines 3.  ETL and Visualization 4.  Bonus…(if time allows)

Page 3: Cloud Big Data Architectures

Save ALL of your Data

Page 4: Cloud Big Data Architectures

“ What is the ACTUAL Cost of ✘  Saving all Data ✘ Using newer technologies ✘ Going beyond Relational

Page 5: Cloud Big Data Architectures

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)

Page 6: Cloud Big Data Architectures

1.

Big Data – Yes! But what kind?

Page 7: Cloud Big Data Architectures

Pattern 1

✘ Which type(s) of Big Data work best? -- when to use Hadoop -- when to use NoSQL

and which type, i.e. key-value, document, graph, etc. -- when to use Big Relational

and what type of workload for hot, warm or cold data

Page 8: Cloud Big Data Architectures

Choice… is good, right?

Page 9: Cloud Big Data Architectures

“ When do I use…? ✘ Hadoop ✘ NoSQL ✘  Big Relational

Page 10: Cloud Big Data Architectures

Size Matters

Page 11: Cloud Big Data Architectures
Page 12: Cloud Big Data Architectures

One Vendor’s View

I don’t Want Text here

Page 13: Cloud Big Data Architectures
Page 14: Cloud Big Data Architectures

Where is Hadoop Used?

Page 15: Cloud Big Data Architectures

Hadoop is your LAST CHOICE

✘ Volume ✘ 10 TB or greater to start ✘ Growth of 25% YOY ✘ Where FROM ✘ Where TO

✘ Velocity and Variety ✘ Spark over HIVE ✘ Kafka and Samsa

✘ Veracity ✘ Pay, train and hire team ✘ Top $$$ for talent ✘ IF you can find it ✘ WATCH OUT for Cloud Vendors who promise ‘easy access’ ✘ Complexity of ecosystem ✘ Cloudera knows best

Page 16: Cloud Big Data Architectures

“ When do I use…? ✘ Hadoop ✘ NoSQL ✘  Big Relational

Page 17: Cloud Big Data Architectures

225 NoSQL Database Types to Choose From

Page 18: Cloud Big Data Architectures

Let’s review some NoSQL concepts Key-Value Redis, Riak, Aerospike

Graph Neo4j

Document MongoDB

Wide-Column Cassandra, HBase

Page 19: Cloud Big Data Architectures

Page 20: Cloud Big Data Architectures

Key Questions - Storage ✘ Volume – how much now, what growth rate? ✘ Variety – what type(s) of data? ‘rectangular’, ‘graph’, ‘k-v’, etc… ✘ Velocity – batches, streams, both, what ingest rate? ✘ Veracity – current state (quality) of data, amount of duplication of

data stores, existence of authoritative (master) data management?

Page 21: Cloud Big Data Architectures

21

✘ Open Source is Free ✘ Not Free §  Rapid iteration, innovation §  Can start up for free (on premise) §  Can ‘rent’ for cheap or free on the cloud §  Can use with the command line for free §  Some vendors offer free online training §  Ex. www.neo4j.org

§  Constant releases §  Can be deceptively hard to set up (time is

money) §  Don’t forget to turn it off if on the cloud! §  GUI tools, support, training cost $$$ §  Ex. www.neo4j.com

NoSQL Example

Page 22: Cloud Big Data Architectures

Practice Applying Concepts - NoSQL

Page 23: Cloud Big Data Architectures

NoSQL Applied

Log Files • ???

Product Catalogs • ???

Social Games • ???

Social aggregators • ???

Line-of-Business • ???

Page 24: Cloud Big Data Architectures

NoSQL Applied

Log Files • Columnstore • HBase

Product Catalogs • Key/Value • Redis

Social Games • Document • MongoDB

Social aggregators • Graph • Neo4j

Line-of-Business • RDBMS • SQL Server

Page 25: Cloud Big Data Architectures

More than NoSQL

NoSQL ✘  Non-relational ✘  Can be optimized in-

memory ✘  Eventually consistent ✘  Schema on Read ✘  Example: Aerospike

NewSQL ✘  Relational plus more ✘  Often in-memory ✘  Some kind of SQL-layer ✘  Schema on Write ✘  Example: MemSQL

U-SQL ✘  What??? ✘  Microsoft’s universal SQL

language ✘  Example: Azure Data Lake

Page 26: Cloud Big Data Architectures

Focus

Page 27: Cloud Big Data Architectures

How Best to Store your Data?

Complexity Scalability Developer Cost

RDBMS easy medium low

NoSQL medium big high

Hadoop hard huge very high

Page 28: Cloud Big Data Architectures

Real World Big Data -- When do I use what?

RDBMS 65%

NoSQL 30%

Hadoop 5%

Page 29: Cloud Big Data Architectures

“ Do the Cloud Vendors Understand

Big Data Realities?

Page 30: Cloud Big Data Architectures

Cloud Big Data Vendors - Storage

AWS ✘  5-10X market share of next

competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational

GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: Query as a

Service

Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: On-premise

integration

Page 31: Cloud Big Data Architectures

Place your screenshot here

AWS Console 17 Data services

Page 32: Cloud Big Data Architectures

Place your screenshot here

GCP Console 8 Data Services

Page 33: Cloud Big Data Architectures

Place your screenshot here

Azure Console 15 Data Services

Page 34: Cloud Big Data Architectures

Cloud Offerings – Big Data AWS Google Microsoft

Managed RDBMS RDS Aurora Cloud SQL Azure SQL

Data Warehouse Redshift BigQuery Azure SQL Data Warehouse

NoSQL buckets S3 Glacier

Cloud Storage Nearline

Azure Blobs StorSimple

NoSQL Key-Value NoSQL Wide Column

DynamoDB Big Table Cloud Datastore

Azure Tables

NoSQL Document NoSQL Graph

MongoDB on EC2 Neo4j on EC2

MongoDB on GCE Neo4j on GCE

DocumentDB Neo4j on Azure

Hadoop Elastic MapReduce DataProc Data Lake HDInsight

Page 35: Cloud Big Data Architectures

Practice Applying Concepts – Real Cost of Storage Types

Page 36: Cloud Big Data Architectures

Cloud NoSQL Applied – AWS

Log Files

Product Catalogs

Social Games

Social aggregators

Line-of-Business

Page 37: Cloud Big Data Architectures

Cloud NoSQL Applied – AWS

Log Files • Stream or

Hadoop • Kinesis or EMR

Product Catalogs • Key/Value • DynamoDB

Social Games • Document • MongoDB

Social aggregators • Graph • Neo4j

Line-of-Business • RDBMS • RDS

Page 38: Cloud Big Data Architectures

??? The fastest growing cloud-based Big Data products are…

Page 39: Cloud Big Data Architectures

Relational The fastest growing cloud-based Big Data products are…

Page 40: Cloud Big Data Architectures

“ When do I use…? ✘ Hadoop ✘ NoSQL ✘  Big Relational

Page 41: Cloud Big Data Architectures

Practice Applying Concepts – Real Cost of Storage Types

Page 42: Cloud Big Data Architectures

Reasons to use Big Relational Cloud Services

Developers DevOps Cloud Vendors – AWS

Developers DevOps Cloud Vendors – GCP

Page 43: Cloud Big Data Architectures

Reasons to use Big Relational Cloud Services

Developers Most know RDBMS query patterns Many know basic administration

DevOps Most know RDBMS administration Many know basic RDBMS queries Many know query optimization

Cloud Vendors - AWS Aurora – RDBMS up to 64 TB Redshift - $ 1k USD / 1 TB / year Rich partner ecosystem – ETL Integration with AWS products

Developers Most know coding language patterns to interact with RDBMS systems

DevOps Familiar RDBMS security patterns Familiar auditing Partner tooling integration

Cloud Vendors - GCP Big Query – familiar SQL queries No hassle streaming ingest No hassle pay-as-you-go Zero administration

Page 44: Cloud Big Data Architectures

My top Big Data Cloud Services

Page 45: Cloud Big Data Architectures

ETL is 75% of all Big Data Projects

Surveying, cleaning and loading data is the majority of the billable time for new Big Data projects.

Page 46: Cloud Big Data Architectures

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)

Page 47: Cloud Big Data Architectures

2.

Data Pipelines Build vs. Buy

Page 48: Cloud Big Data Architectures

Pattern 2

✘ How to build optimized cloud-based data pipelines? -- Cloud-based ETL tools and processes -- includes load-testing patterns and security practices -- including connecting between different vendor clouds

Page 49: Cloud Big Data Architectures

Key Questions – Ingestion and ETL ✘ Volume – how much and how fast, now and future? ✘ Variety – what type(s) or data, any pre-processing needed? ✘ Velocity – batches or steaming? ✘ Veracity – verification on ingest needed? new data needed?

Page 50: Cloud Big Data Architectures

Together How does your data pipeline flow?

Page 51: Cloud Big Data Architectures

“ Considering… ✘  Initial Load/Transform ✘ Data Quality ✘  Batch vs. Stream

Page 52: Cloud Big Data Architectures

Pipeline Phases Phase 0

Eval Current Data - Quality & Quantity Phase 1

Get New Data - Free or Premium Phase 2

Build MVP & Forecast volume and growth Phase 3

Load test at scale Phase 4

Deploy – secure, audit and monitor

Page 53: Cloud Big Data Architectures

Cloud Big Data Vendors - ETL

AWS ✘  5X market share of next

competitor ✘  Notable: Many, strong ETL

Partners

GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: DataFlow requires

Java or Python developers

Azure ✘  Difficulty with scale ✘  Best tooling integration ✘  Notable: Nothing

Page 54: Cloud Big Data Architectures

How Best to Ingest and ETL your Data?

Complexity Scalability Developer Cost

RDBMS medium medium low

NoSQL medium big high

Hadoop hard huge very high

Page 55: Cloud Big Data Architectures

“ Considering… ✘  Initial Load/Transform ✘ Data Quality ✘  Batch vs. Stream

Page 56: Cloud Big Data Architectures

Building a Streaming Pipeline

Stream Interval Window

Page 57: Cloud Big Data Architectures
Page 58: Cloud Big Data Architectures

“ Near Real-time Streams

Load Test All The Things

Page 59: Cloud Big Data Architectures

Key Questions - Streaming ✘ Volume – how much data now and predicted over next 12 months? ✘ Variety – what types of data now and future? ✘ Velocity – volume of input data / time now and near future? ✘ Veracity – volume of EXISTING data now

Page 60: Cloud Big Data Architectures

Cloud Big Data Vendors - Streaming

AWS ✘  5X market share of next

competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Kinesis Firehose

GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Requires top developers ✘  Notable: DataFlow flexible

Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Stream Analytics

integration with other products

Page 61: Cloud Big Data Architectures

Place your screenshot here

AWS Console 17 Data services

Page 62: Cloud Big Data Architectures

Place your screenshot here

GCP Console 8 Data Services

Page 63: Cloud Big Data Architectures

Place your screenshot here

Azure Console 15 Data Services

Page 64: Cloud Big Data Architectures

Cloud Offerings – Data and Pipelines AWS Google Microsoft

Managed RDBMS RDS Aurora Cloud SQL Azure SQL

Data Warehouse Redshift BigQuery Azure SQL Data Warehouse

NoSQL buckets S3 Glacier

Cloud Storage Nearline

Azure Blobs StorSimple

NoSQL Key-Value NoSQL Wide Column

DynamoDB Big Table Cloud Datastore

Azure Tables

Streaming or ML Kinesis AWS Machine Learning

DataFlow Google Machine Learning

StreamInsight Azure ML

NoSQL Document NoSQL Graph

MongoDB on EC2 Neo4j on EC2

MongoDB on GCE Neo4j on GCE

DocumentDB Neo4j on Azure

Hadoop Elastic MapReduce DataProc Data Lake HDInsight

Cloud ETL Data Pipelines DataFlow Azure Data Pipeline

Page 65: Cloud Big Data Architectures

How Best to Stream your Data?

Complexity Scalability Developer Cost

Batches easy medium low

Windows difficult big high

Real-time very difficult huge high

Page 66: Cloud Big Data Architectures

Practice Applying Concepts

Page 67: Cloud Big Data Architectures

Designing Cloud Data Pipelines

Log Files

Product Catalogs

Social Games

Social aggregators

Line-of-Business

Page 68: Cloud Big Data Architectures

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)

Page 69: Cloud Big Data Architectures

3.

Making Sense of Data Analytics and Presentation

Page 70: Cloud Big Data Architectures

Pattern 3

✘ How best to Query and Visualize -- When to use business analytics vs. predictive analytics (machine learning) -- how best to present data to clients - partner visualization products or roll your own

Page 71: Cloud Big Data Architectures

Making Sense of Data

Machine Learning Reports Presentation

Page 72: Cloud Big Data Architectures

Key Questions - Query ✘ Volume ✘ Variety ✘ Velocity ✘ Veracity

Page 73: Cloud Big Data Architectures

Graphs What is nature of your questions?

Page 74: Cloud Big Data Architectures
Page 75: Cloud Big Data Architectures

Cloud Big Data Vendors - Query

AWS ✘  5X market share of next

competitor ✘  Most complete offering ✘  Most mature offering ✘  Notable: Big Relational

GCP ✘  Lean, mean and cheap ✘  Fastest player ✘  Notable: Flexible, powerful

machine learning

Azure ✘  WATCH OUT – Cost! ✘  Notable: Developer Tooling

Page 76: Cloud Big Data Architectures

Query Languages

SQL Everyone knows it But how well do they know it?

NoSQL Vendor Language Too many to list How will you learn it?

Cypher Query language for graph databases The future?

ORM Good, bad or horrible? Again, how well do they know it?

HIVE Shown in too many vendor demos Really hard to make performant

Machine Learning Queries SciPy, NumPy or Python R Language Julie Language Many more…

Page 77: Cloud Big Data Architectures

Practice Applying Concepts – Understanding D3

Page 78: Cloud Big Data Architectures

How Best to Query your Data?

Business Analytics

Predictive Analytics

Developer Cost

RDBMS

NoSQL

Hadoop

Page 79: Cloud Big Data Architectures

How Best to Query your Data?

Business Analytics

Predictive Analytics

Developer Cost

RDBMS easy medium low

NoSQL hard very hard very high

Hadoop hard hard very high

Page 80: Cloud Big Data Architectures

Machine Learning aka Predictive Analytics

AWS ML for developers GUI-based

GCP 3 Flavors of ML Python-based languages

Azure ML for Data Scientists R Language

Page 81: Cloud Big Data Architectures

Presentation

If you can’t see it, it’s not worth it.

Page 82: Cloud Big Data Architectures

Dashboards ✘  More than KPIs ✘  Mobile ✘  Alerts ✘  Data Stories

Innovation in Data Visualization

Reports ✘  Level of Detail ✘  Meaningful Taxonomies ✘  Fast enough ✘  Drill for Data

Page 83: Cloud Big Data Architectures

D3 The language of Data Visualization

Page 84: Cloud Big Data Architectures
Page 85: Cloud Big Data Architectures

Cloud Big Data Vendors - Visualization

AWS ✘  Most complete offering ✘  Notable: Partners &

QuickSight

GCP ✘  Big Query Partners ✘  Notable: New Dashboards

Azure ✘  Integrated ✘  Notable: PowerBI

Page 86: Cloud Big Data Architectures

About this Workshop

Real-world Cloud Scenarios w/AWS, Azure and GCP 1.  When to use which type of Big Data Solution 2.  The new world of Data Pipelines 3.  ETL and Visualization Practicalities 4.  Bonus…(if time allows)

Page 87: Cloud Big Data Architectures

4.

About IoT It’s happening now

Page 88: Cloud Big Data Architectures

Place your screenshot here

Data Generation Device

Page 89: Cloud Big Data Architectures

IoT is

Big Data Realized

Page 90: Cloud Big Data Architectures

235,000,000,000 $ The IoT Market

2017 By the year

20 Billion devices And a lot of users

Page 91: Cloud Big Data Architectures

IoT all the Things

Page 92: Cloud Big Data Architectures

Cloud Big Data Vendors - IoT

AWS ✘  First to market ✘  Most complete offering ✘  Most mature offering ✘  Notable: AWS IoT Rules

GCP ✘  Still in Beta ✘  Fastest player ✘  Requires top developers ✘  Notable: Weave

Azure ✘  Catching up ✘  Best tooling integration ✘  Notable: Device Mgmt.

Page 93: Cloud Big Data Architectures

Save ALL of your Data

Page 94: Cloud Big Data Architectures

The Next Generation…

Page 95: Cloud Big Data Architectures

‘brigada!

Any questions?

You can find me at @lynnlangit