49
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Matt Yanchyshyn, Sr. Manager Solutions Architecture June 17 th , 2015 AWS Deep Dive Big Data Analytics and Business Intelligence

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Embed Size (px)

Citation preview

Page 1: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Matt Yanchyshyn, Sr. Manager Solutions Architecture

June 17th, 2015

AWS Deep DiveBig Data Analytics and Business Intelligence

Page 2: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Analytics and BI on AWS

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)

AWS Lambda

KCL Apps

Amazon EMR

Amazon Redshift

Amazon MachineLearning

Collect Process AnalyzeStore

Data Collectionand Storage

DataProcessing

EventProcessing

Data Analysis

Page 3: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Batch processing

GBs of logs pushed to Amazon

S3 hourly

Daily Amazon EMR cluster using Hive to

process data

Input and output stored in Amazon S3

Load subset into Amazon Redshift

Page 4: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Reporting

Amazon S3 Log Bucket

Amazon EMR Structured log data

AmazonRedshift

Operational Reports

Page 5: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Streaming data processing

TBs of logs sent daily

Logs stored in Amazon Kinesis

Amazon Kinesis Client Library

AWS Lambda

Amazon EMR

Amazon EC2

Page 6: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

TBs of logs sent daily

Logs stored inAmazon S3

Amazon EMR clusters

Hive Metastoreon Amazon EMR

Interactive query

Page 7: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Structured dataIn Amazon Redshift

Load predictions intoAmazon Redshift

-or-Read prediction results

directly from S3

Predictions in S3

Query for predictions with Amazon ML batch API

Your application

Batch predictions

Page 8: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Your applicationAmazon

DynamoDB

Lambda

+

Trigger event with Lambda+

Query for predictions with Amazon ML real-time API

Real-time predictions

Page 9: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Machine Learning

Page 10: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Machine Learning

Easy to use, managed machine learning service built for developers

Create models using data stored in AWS

Deploy models to production in seconds

Page 11: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Powerful machine learning technology

Based on Amazon’s battle-hardened internal systems

Not just the algorithms:Smart data transformationsInput data and model quality alertsBuilt-in industry best practices

Grows with your needsTrain on up to 100 GB of dataGenerate billions of predictionsObtain predictions in batches or real-time

Page 12: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Pay-as-you-go and inexpensive

Data analysis, model training, and evaluation: $0.42/instance hour

Batch predictions: $0.10/1000

Real-time predictions: $0.10/1000+ hourly capacity reservation charge

Page 13: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Build & Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

Page 14: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Create a Datasource object

Page 15: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Create a Datasource object

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)

Page 16: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Explore and understand your data

Page 17: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Train your model

Page 18: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Train your model

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')

Page 19: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Build & Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

Page 20: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Explore model quality

Page 21: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Fine-tune model interpretation

Page 22: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Build & Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

Page 23: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Batch predictions

Asynchronous, large-volume prediction generation

Request through service console or API

Best for applications that deal with batches of data records

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)

Page 24: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Real-time predictions

Synchronous, low-latency, high-throughput prediction generation

Request through service API or server or mobile SDKs

Best for interaction applications that deal with individual data records

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’})

{ 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } }}

Page 25: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Elastic MapReduce (EMR)

Page 26: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

ElasticEasily add or remove capacity

ReliableSpend less time monitoring

SecureManage firewalls

FlexibleControl the cluster

Page 27: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

The Hadoop ecosystem can run in Amazon EMR

Page 28: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Try different configurations to find your optimal architecture

CPUc3 family

cc1.4xlargecc2.8xlarge

Memorym2 familyr3 family

Disk/IOd2 familyi2 family

Generalm1 familym3 family

Choose your instance types

Batch Machine Spark and Large

process learning interactive HDFS

Page 29: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Easy to add/remove compute capacity to your cluster

Match compute demands with cluster sizing

Resizable clusters

Page 30: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Spot Instances for task nodes

Up to 90% off Amazon EC2

on-demand pricing

On-demand for core nodes

Standard Amazon EC2

pricing for on-demand

capacity

Easy to use Spot Instances

Meet SLA at predictable cost Exceed SLA at lower cost

Page 31: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon S3 as your persistent data store

Separate compute and storage

Resize and shut down Amazon EMR clusters with no data loss

Point multiple Amazon EMR clusters at same data in Amazon S3

EMR

EMR

Amazon S3

Page 32: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

EMRFS makes it easier to use Amazon S3

Read-after-write consistency

Very fast list operations

Error handling options

Support for Amazon S3 encryption

Transparent to applications: s3://

Page 33: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

EMRFS client-side encryption

Amazon S3

Am

azon

S3

encr

yptio

n cl

ient

s

EM

RF

S enabled for

Am

azon S3 client-side encryption

Key vendor (AWS KMS or your custom key vendor)

(client-side encrypted objects)

Page 34: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

HDFS is still there if you need it

Iterative workloads

• If you’re processing the same dataset more than once

Disk I/O intensive workloads

Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing

Page 35: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift

Page 36: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift Architecture

Leader Node• SQL endpoint• Stores metadata• Coordinates query execution

Compute Nodes• Execute queries in parallel• Node types to match your

workload: Dense Storage (DS2) or Dense Compute (DC1)

• Divided into multiple slices• Local, columnar storage

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 37: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storageWith column storage, you only

read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 38: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage

• COPY compresses automatically

• You can analyze and override

• More performance, less cost

Page 39: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage• Track the minimum and

maximum value for each block

• Skip over blocks that don’t contain relevant data

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 40: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift

Column storage

Data compression

Zone maps

Direct-attached storage

• Local storage for performance

• High scan rates

• Automatic replication

• Continuous backup and streaming restores to/from Amazon S3

• User snapshots on demand

• Cross region backups for disaster recovery

Page 41: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift online resize

Continue querying during resize

New cluster deployed in the background at no extra cost

Data copied in parallel from node to node

Automatic SQL endpoint switchover via DNS

Page 42: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

SnowflakeStar

Amazon Redshift works with existing data models

Page 43: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Distribution Key All

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

key1

key2

key3

key4

All data on every node

Same key to same location

Node 1

Slice 1

Slice 2

Node 2

Slice 3

Slice 4

EvenRound robin distribution

Amazon Redshift data distribution

Page 44: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Sorting data in Amazon Redshift

In the slices (on disk), the data is sorted by a sort key

Choose a sort key that is frequently used in your queries

Data in columns is marked with a min/max value so Redshift can skip blocks not relevant to the query

A good sort key also prevents reading entire blocks

Page 45: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

User Defined Functions

Python 2.7

PostgreSQL UDF Syntax System

Network calls within UDFs are prohibited

Pandas, NumPy, and SciPy pre-installed

Import your own

Page 46: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Interleaved Multi Column Sort

Currently support Compound Sort Keys• Optimized for applications that filter data by one leading column

Adding support for Interleaved Sort Keys• Optimized for filtering data by up to eight columns• No storage overhead unlike an index• Lower maintenance penalty compared to indexes

Page 47: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Amazon Redshift works with yourexisting analysis tools

JDBC/ODBC

Amazon Redshift

Page 48: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

Questions?

Page 49: AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services.

Details• July 1, 2015 • Chicago, Illinois• @ McCormick Place

Featuring• New product launches• 36+ sessions, labs, and bootcamps• Executive and partner networking

Registration is now open • Come and see what AWS and the cloud can do for you.• Click here to register: http://amzn.to/1RooPPL