AWS Roadshow Herbst 2013: Datenanalyse und Business Intelligence

Preview:

DESCRIPTION

Vortrag von der AWS Roadshow Herbst 2013

Citation preview

AWS Roadshow 2013Über den Wolken – befreien Sie Ihre IT

Datenanalyse und Business Intelligence

Michael HanischMgr. Solutions Architecture

Matthias JungSolutions Architect

Constantin GonzalezSolutions Architect

1. Introducing Big Data

2. From data to actionable information

3. Analytics and Cloud Computing

Overview

Introducing Big Data

1

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

The cost of data generation is falling

The volume of data is increasing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,higher throughput

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,higher throughput

Highlyconstrained

Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Elastic and highly scalable

No upfront capital expense

Only pay for what you use+

+

Available on-demand+

=Remove

constraints

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower cost,higher throughput

Highlyconstrained

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Accelerated

Technologies and techniques for working productively with data,

at any scale.

Big Data

From data to

actionable information

2

“Who buys video games?”

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

500% return on ad spend

From 2 months procurement timeto a few minutes

Results:

“Who is using our service?”

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

9,432,061 unique mobile devices used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Analytics and

Cloud Computing

3

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

S3, Glacier,Storage Gateway,

DynamoDB, Redshift, RDS,

HBase

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

EC2 &Elastic MapReduce

Generation

Collection & storage

Analytics & computation

Collaboration & sharingEC2 & S3,

CloudFormation,Elastic MapReduce,

RDS, DynamoDB, Redshift

Generation

Collection & storage

Analytics & computation

Collaboration & sharingEC2 & S3,

CloudFormation,Elastic MapReduce,

RDS, DynamoDB, Redshift

EC2 &Elastic MapReduce

S3, Glacier,Storage Gateway,

DynamoDB, Redshift, RDS,

HBaseAWS Data Pipeline

Simple Storage Service

S3

Elastic MapReduce

EMR

What is EMR?

Map-Reduce engine Integrated with tools

Hadoop-as-a-service

Massively parallel

Cost effective AWS wrapper

Integrated to AWS services

How does it work?

EMR

EMR ClusterS3

1. Put the data into S3 (or HDFS)

3. Get the results

2. Launch your cluster. Choose:• Hadoop distribution• How many nodes• Node type (hi-CPU,

hi-memory, etc.)• Hadoop apps (Hive,

Pig, HBase)

EMR

EMR Cluster

How does it work?

S3

You can easily resize the cluster

EMR

EMR Cluster

How does it work?

S3

Use Spot nodes to save time

and money

EMR

EMR Cluster

How does it work?

S3

Launch parallel clusters against the same data source (tune for the

workload)

How does it work?

EMR ClusterS3

When the work is complete, you can terminate the cluster

(and stop paying)

How does it work?

You can store everything in HDFS

(local disk)

High Storage nodes = 48 TB/node

EMR Cluster

EMR Cluster

How does it work?

Launch in a Virtual Private Cloud for

extra security

Thousands of Customers, 5+ Million Clusters

Integrates with Hadoop Ecosystem

EMR

Integrates with Hadoop Ecosystem

EMR

Give it a try:aws.amazon.com/elasticmapreduce

Cost to run a 100-node EMR cluster:EUR 6.15/hour

($8/h)

Photos: renee_mcgurk https://www.flickr.com/photos/51018933@N08/5355664961/in/photostream/Calgary Reviews https://www.flickr.com/photos/calgaryreviews/6328302248/in/photostream/

+

What if all I want is a database?

No upfront costs, pay as you go

Really fast performance at a really low price

Open and flexible with support for popular tools

Easy to provision and scale up massively

Customers asked us for a data warehouse the AWS way:

A fast and powerful, petabyte-scale data warehouse that is

A Lot Faster

A Lot Cheaper

A Whole Lot SimplerAmazon Redshift

Amazon Redshift Is:

Column storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

Id Age State

123 20 CA

345 25 WA

678 40 FL

Amazon Redshift Dramatically Reduces IO

Amazon Redshift parallelizes and distributes everything

Query

Load

Backup

Restore

Resize

Amazon Redshift Runs on Optimized Hardware

HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate

HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage

128 GB RAM16 cores

16 TB disk

16 GB RAM

2 TB disk

2 cores

Optimized for I/O intensive workloads

High disk density

Runs in HPC - fast network

HS1.8XL available on Amazon EC2

Redshift lets you start small and grow bigExtra Large Node (XL)3 spindles, 2TB, 15GiB RAM 2 virtual cores, 10GigE

Single Node (2TB)

Cluster 2-32 Nodes (4TB – 64TB)

8 Extra Large Node (8XL)24 spindles, 16TB, 120GiB RAM16 virtual cores, 10GigE

Cluster 2-100 Nodes (32TB – 1.6PB)8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

8XL

XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

XL XL XL XL XL XL XL XL

Priced to Analyze All the Customer’s Data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing: Number of Nodes x Cost per Hour

No charge for Leader Node

Pay as you grow

Amazon Redshift Simplifies Provisioning

• Create a cluster in minutes

• Automatically patch your OS and data warehouse software

• Scale up to 1.6PB with a few clicks and no downtime

Amazon RedshiftAmazon Redshift

Amazon Redshift Simplifies Operations

• Built-in security in transit, at rest, when backed up*

• Backup to S3 is continuous, incremental, and automatic

• Disk failures are transparent; nodes recover automatically

• Streaming restores resumes querying faster

Amazon S3Clients

*SSL, Amazon VPC, AES-256 (Hardware Accelerated)

(Optional) SSL Continuous, Automatic Backup

Streaming Restore

Amazon Redshift

Initial Pilot Results

Current production environment32 nodes, 128 CPUs, 4.2TB RAM, 1.6 PB disk

Tested 2B row data set, 6 representative queries on a

2-node Amazon Redshift cluster

queries ran > 10x faster

Amazon Redshift Integrates With All Data Sources

Amazon DynamoDB

Amazon Elastic MapReduce

Amazon Simple Storage Service (S3)

Amazon EC2

AWS Storage Gateway Service

Corporate Data Center

Amazon Relational Database Service (RDS)

Amazon Redshift

Integrates With Existing BI Tools

Connect your tools to Amazon Redshift using standard drivers from PostgreSQL.org

Amazon Redshift

JDBC/ODBC

DataIntegrationPartners*

On-Premises Integration

RDBMS

Redshift

OLTPERP

Reportingand BI

Cloud ETL for Big Data

• Maintain online SQL access to your historical data• Transformation and enrichment with EMR• Longer history ensures better insight

RedshiftElastic MapReduceS3

Reportingand BI

Thanks.glez@amazon.de

Learn More: aws.amazon.com/big-data

Thank you!glez@amazon.de

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage temporary compute resources

Anatomy of a pipeline

Additional checks and notifications

Arbitrarily complex pipelines

Recommended