Upload
amazon-web-services
View
351
Download
5
Embed Size (px)
Citation preview
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Matt Yanchyshyn, Sr. Manager Solutions Architecture
June 17th, 2015
AWS Deep DiveBig Data Analytics and Business Intelligence
Analytics and BI on AWS
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon EMR
Amazon Redshift
Amazon MachineLearning
Collect Process AnalyzeStore
Data Collectionand Storage
DataProcessing
EventProcessing
Data Analysis
Batch processing
GBs of logs pushed to Amazon
S3 hourly
Daily Amazon EMR cluster using Hive to
process data
Input and output stored in Amazon S3
Load subset into Amazon Redshift
Reporting
Amazon S3 Log Bucket
Amazon EMR Structured log data
AmazonRedshift
Operational Reports
Streaming data processing
TBs of logs sent daily
Logs stored in Amazon Kinesis
Amazon Kinesis Client Library
AWS Lambda
Amazon EMR
Amazon EC2
TBs of logs sent daily
Logs stored inAmazon S3
Amazon EMR clusters
Hive Metastoreon Amazon EMR
Interactive query
Structured dataIn Amazon Redshift
Load predictions intoAmazon Redshift
-or-Read prediction results
directly from S3
Predictions in S3
Query for predictions with Amazon ML batch API
Your application
Batch predictions
Your applicationAmazon
DynamoDB
Lambda
+
Trigger event with Lambda+
Query for predictions with Amazon ML real-time API
Real-time predictions
Amazon Machine Learning
Amazon Machine Learning
Easy to use, managed machine learning service built for developers
Create models using data stored in AWS
Deploy models to production in seconds
Powerful machine learning technology
Based on Amazon’s battle-hardened internal systems
Not just the algorithms:Smart data transformationsInput data and model quality alertsBuilt-in industry best practices
Grows with your needsTrain on up to 100 GB of dataGenerate billions of predictionsObtain predictions in batches or real-time
Pay-as-you-go and inexpensive
Data analysis, model training, and evaluation: $0.42/instance hour
Batch predictions: $0.10/1000
Real-time predictions: $0.10/1000+ hourly capacity reservation charge
Build & Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
Create a Datasource object
Create a Datasource object
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ds = ml.create_data_source_from_s3( data_source_id = ’my_datasource', data_spec= { 'DataLocationS3':'s3://bucket/input/', 'DataSchemaLocationS3':'s3://bucket/input/.schema'}, compute_statistics = True)
Explore and understand your data
Train your model
Train your model
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_ml_model( ml_model_id=’my_model', ml_model_type='REGRESSION', training_data_source_id='my_datasource')
Build & Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
Explore model quality
Fine-tune model interpretation
Build & Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
Batch predictions
Asynchronous, large-volume prediction generation
Request through service console or API
Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction( batch_prediction_id = 'my_batch_prediction’ batch_prediction_data_source_id = ’my_datasource’ ml_model_id = ’my_model', output_uri = 's3://examplebucket/output/’)
Real-time predictions
Synchronous, low-latency, high-throughput prediction generation
Request through service API or server or mobile SDKs
Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict( ml_model_id=’my_model', predict_endpoint=’example_endpoint’, record={’key1':’value1’, ’key2':’value2’})
{ 'Prediction': { 'predictedValue': 13.284348, 'details': { 'Algorithm': 'SGD', 'PredictiveModelType': 'REGRESSION’ } }}
Amazon Elastic MapReduce (EMR)
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManage firewalls
FlexibleControl the cluster
The Hadoop ecosystem can run in Amazon EMR
Try different configurations to find your optimal architecture
CPUc3 family
cc1.4xlargecc2.8xlarge
Memorym2 familyr3 family
Disk/IOd2 familyi2 family
Generalm1 familym3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Easy to add/remove compute capacity to your cluster
Match compute demands with cluster sizing
Resizable clusters
Spot Instances for task nodes
Up to 90% off Amazon EC2
on-demand pricing
On-demand for core nodes
Standard Amazon EC2
pricing for on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
Amazon S3 as your persistent data store
Separate compute and storage
Resize and shut down Amazon EMR clusters with no data loss
Point multiple Amazon EMR clusters at same data in Amazon S3
EMR
EMR
Amazon S3
EMRFS makes it easier to use Amazon S3
Read-after-write consistency
Very fast list operations
Error handling options
Support for Amazon S3 encryption
Transparent to applications: s3://
EMRFS client-side encryption
Amazon S3
Am
azon
S3
encr
yptio
n cl
ient
s
EM
RF
S enabled for
Am
azon S3 client-side encryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
HDFS is still there if you need it
Iterative workloads
• If you’re processing the same dataset more than once
Disk I/O intensive workloads
Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
Amazon Redshift
Amazon Redshift Architecture
Leader Node• SQL endpoint• Stores metadata• Coordinates query execution
Compute Nodes• Execute queries in parallel• Node types to match your
workload: Dense Storage (DS2) or Dense Compute (DC1)
• Divided into multiple slices• Local, columnar storage
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
Customer VPC
InternalVPC
JDBC/ODBC
LeaderNode
Compute Node
Compute Node
Compute Node
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storageWith column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
analyze compression listing;
Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
• COPY compresses automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage• Track the minimum and
maximum value for each block
• Skip over blocks that don’t contain relevant data
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Amazon Redshift
Column storage
Data compression
Zone maps
Direct-attached storage
• Local storage for performance
• High scan rates
• Automatic replication
• Continuous backup and streaming restores to/from Amazon S3
• User snapshots on demand
• Cross region backups for disaster recovery
Amazon Redshift online resize
Continue querying during resize
New cluster deployed in the background at no extra cost
Data copied in parallel from node to node
Automatic SQL endpoint switchover via DNS
SnowflakeStar
Amazon Redshift works with existing data models
Distribution Key All
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
key1
key2
key3
key4
All data on every node
Same key to same location
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
EvenRound robin distribution
Amazon Redshift data distribution
Sorting data in Amazon Redshift
In the slices (on disk), the data is sorted by a sort key
Choose a sort key that is frequently used in your queries
Data in columns is marked with a min/max value so Redshift can skip blocks not relevant to the query
A good sort key also prevents reading entire blocks
User Defined Functions
Python 2.7
PostgreSQL UDF Syntax System
Network calls within UDFs are prohibited
Pandas, NumPy, and SciPy pre-installed
Import your own
Interleaved Multi Column Sort
Currently support Compound Sort Keys• Optimized for applications that filter data by one leading column
Adding support for Interleaved Sort Keys• Optimized for filtering data by up to eight columns• No storage overhead unlike an index• Lower maintenance penalty compared to indexes
Amazon Redshift works with yourexisting analysis tools
JDBC/ODBC
Amazon Redshift
Questions?
AWS Summit – Chicago: An exciting, free cloud conference designed to educate and inform new customers about the AWS platform, best practices and new cloud services.
Details• July 1, 2015 • Chicago, Illinois• @ McCormick Place
Featuring• New product launches• 36+ sessions, labs, and bootcamps• Executive and partner networking
Registration is now open • Come and see what AWS and the cloud can do for you.• Click here to register: http://amzn.to/1RooPPL