View
530
Download
4
Category
Tags:
Preview:
DESCRIPTION
Vortrag von der AWS Roadshow Herbst 2013
Citation preview
AWS Roadshow 2013Über den Wolken – befreien Sie Ihre IT
Datenanalyse und Business Intelligence
Michael HanischMgr. Solutions Architecture
Matthias JungSolutions Architect
Constantin GonzalezSolutions Architect
1. Introducing Big Data
2. From data to actionable information
3. Analytics and Cloud Computing
Overview
Introducing Big Data
1
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
The cost of data generation is falling
The volume of data is increasing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,higher throughput
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,higher throughput
Highlyconstrained
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Elastic and highly scalable
No upfront capital expense
Only pay for what you use+
+
Available on-demand+
=Remove
constraints
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Lower cost,higher throughput
Highlyconstrained
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Accelerated
Technologies and techniques for working productively with data,
at any scale.
Big Data
From data to
actionable information
2
“Who buys video games?”
3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:
500% return on ad spend
From 2 months procurement timeto a few minutes
Results:
“Who is using our service?”
Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs
9,432,061 unique mobile devices used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013
Analytics and
Cloud Computing
3
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
S3, Glacier,Storage Gateway,
DynamoDB, Redshift, RDS,
HBase
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
EC2 &Elastic MapReduce
Generation
Collection & storage
Analytics & computation
Collaboration & sharingEC2 & S3,
CloudFormation,Elastic MapReduce,
RDS, DynamoDB, Redshift
Generation
Collection & storage
Analytics & computation
Collaboration & sharingEC2 & S3,
CloudFormation,Elastic MapReduce,
RDS, DynamoDB, Redshift
EC2 &Elastic MapReduce
S3, Glacier,Storage Gateway,
DynamoDB, Redshift, RDS,
HBaseAWS Data Pipeline
Simple Storage Service
S3
Elastic MapReduce
EMR
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
How does it work?
EMR
EMR ClusterS3
1. Put the data into S3 (or HDFS)
3. Get the results
2. Launch your cluster. Choose:• Hadoop distribution• How many nodes• Node type (hi-CPU,
hi-memory, etc.)• Hadoop apps (Hive,
Pig, HBase)
EMR
EMR Cluster
How does it work?
S3
You can easily resize the cluster
EMR
EMR Cluster
How does it work?
S3
Use Spot nodes to save time
and money
EMR
EMR Cluster
How does it work?
S3
Launch parallel clusters against the same data source (tune for the
workload)
How does it work?
EMR ClusterS3
When the work is complete, you can terminate the cluster
(and stop paying)
How does it work?
You can store everything in HDFS
(local disk)
High Storage nodes = 48 TB/node
EMR Cluster
EMR Cluster
How does it work?
Launch in a Virtual Private Cloud for
extra security
Thousands of Customers, 5+ Million Clusters
Integrates with Hadoop Ecosystem
EMR
Integrates with Hadoop Ecosystem
EMR
Give it a try:aws.amazon.com/elasticmapreduce
Cost to run a 100-node EMR cluster:EUR 6.15/hour
($8/h)
Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/
+
What if all I want is a database?
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
Easy to provision and scale up massively
Customers asked us for a data warehouse the AWS way:
A fast and powerful, petabyte-scale data warehouse that is
A Lot Faster
A Lot Cheaper
A Whole Lot SimplerAmazon Redshift
Amazon Redshift Is:
Column storage
Data compression
Zone maps
Direct-attached storage
Large data block sizes
Id Age State
123 20 CA
345 25 WA
678 40 FL
Amazon Redshift Dramatically Reduces IO
Amazon Redshift parallelizes and distributes everything
Query
Load
Backup
Restore
Resize
Amazon Redshift Runs on Optimized Hardware
HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate
HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage
128 GB RAM16 cores
16 TB disk
16 GB RAM
2 TB disk
2 cores
Optimized for I/O intensive workloads
High disk density
Runs in HPC - fast network
HS1.8XL available on Amazon EC2
Redshift lets you start small and grow bigExtra Large Node (XL)3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE
Single Node (2TB)
Cluster 2-32 Nodes (4TB – 64TB)
8 Extra Large Node (8XL)24 spindles, 16TB, 120GiB RAM16 virtual cores, 10GigE
Cluster 2-100 Nodes (32TB – 1.6PB)8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
8XL
XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
XL XL XL XL XL XL XL XL
Priced to Analyze All the Customer’s Data
Price Per Hour for HS1.XL Single Node
Effective Hourly Price Per TB Effective Annual Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing: Number of Nodes x Cost per Hour
No charge for Leader Node
Pay as you grow
Amazon Redshift Simplifies Provisioning
• Create a cluster in minutes
• Automatically patch your OS and data warehouse software
• Scale up to 1.6PB with a few clicks and no downtime
Amazon RedshiftAmazon Redshift
Amazon Redshift Simplifies Operations
• Built-in security in transit, at rest, when backed up*
• Backup to S3 is continuous, incremental, and automatic
• Disk failures are transparent; nodes recover automatically
• Streaming restores resumes querying faster
Amazon S3Clients
*SSL, Amazon VPC, AES-256 (Hardware Accelerated)
(Optional) SSL Continuous, Automatic Backup
Streaming Restore
Amazon Redshift
Initial Pilot Results
Current production environment32 nodes, 128 CPUs, 4.2TB RAM, 1.6 PB disk
Tested 2B row data set, 6 representative queries on a
2-node Amazon Redshift cluster
queries ran > 10x faster
Amazon Redshift Integrates With All Data Sources
Amazon DynamoDB
Amazon Elastic MapReduce
Amazon Simple Storage Service (S3)
Amazon EC2
AWS Storage Gateway Service
Corporate Data Center
Amazon Relational Database Service (RDS)
Amazon Redshift
Integrates With Existing BI Tools
Connect your tools to Amazon Redshift using standard drivers from PostgreSQL.org
Amazon Redshift
JDBC/ODBC
DataIntegrationPartners*
On-Premises Integration
RDBMS
Redshift
OLTPERP
Reportingand BI
Cloud ETL for Big Data
• Maintain online SQL access to your historical data• Transformation and enrichment with EMR• Longer history ensures better insight
RedshiftElastic MapReduceS3
Reportingand BI
Thanks.glez@amazon.de
Learn More: aws.amazon.com/big-data
Thank you!glez@amazon.de
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage temporary compute resources
Anatomy of a pipeline
Additional checks and notifications
Arbitrarily complex pipelines
Recommended