(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014

Preview:

DESCRIPTION

Leveraging big data and high performance computing (HPC) solutions enables your organization to make smarter and faster decisions that influence strategy, increase productivity, and ultimately grow your business. We kick off the Big Data and HPC track with the latest advancements in data analytics, databases, storage, and HPC at AWS. Hear customer success stories and discover how to put data to work in your own organization.

Citation preview

November 12, 2014 | Las Vegas, NV

Ben Butler, Sr. Solutions Marketing Mgr., Big Data and HPC

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Big data on AWS

Big data customer success stories

HPC on AWS

HPC Customer Presentation: Honda

AWS resources to get started

Big data and HPC track review: where to go next

Generation

Collection and storage

Analytics and computation

Collaboration and sharing

Generation

Collection and storage

Analytics and computation

Collaboration and sharing

• IT/Application server logs

• Websites/Mobile apps/Ads

• Sensor data/IoT

• Social media, user content

GBTB

PB

ZB

EB

Lower cost,

higher throughput Generation

Collection and storage

Analytics and computation

Collaboration and sharing

Highly

constrained

Lower cost,

higher throughput Generation

Collection and storage

Analytics and computation

Collaboration and sharing

What is Big Data?

collect, store, organize, analyze and share it

Technologies and techniques for working

productively with data, at any scale.

Big Data

Accelerated

Generation

Collection and storage

Analytics and computation

Collaboration and sharing

AWS CloudBig Data

AnalyzeIngest

Amazon Kinesis

AWS Import/Export

AWS Direct Connect

Collect

Amazon

Glacier

Amazon S3

Amazon

DynamoDB

Store

Amazon

Elastic

MapReduce

Amazon

EC2

Kinesis

Amazon

S3

Share

Amazon

Redshift

Amazon

Redshift

AWS Data

Pipeline

Real-time processing

High throughput; elastic

Easy to use

Amazon EMR, S3, Redshift,

DynamoDB Integrations

Amazon

Kinesis

Store anything

Object storage

Scalable

99.999999999% durability

Amazon

S3

NoSQL Database

Seamless scalability

Zero admin

Single digit millisecond latency

Amazon

DynamoDB

Relational data warehouse

Massively parallel

Petabyte scale

Fully managed

$1,000/TB/Year

Amazon

Redshift

Hadoop/HDFS clusters

Hive, Pig, Impala, HBase

Easy to use; fully managed

Scale to thousands of nodes

Amazon

Elastic

MapReduce

Corporate Data

Center

Elastic Data

Center

Corporate Data

Center

Elastic Data

Center

Application data

and logs for

analysis pushed

to Amazon S3

Corporate Data

Center

Elastic Data

Center

Amazon Elastic

MapReduce name

node to control

analysis

Corporate Data

Center

Elastic Data

Center

Hadoop cluster

started by Elastic

MapReduce

N

Corporate Data

Center

Elastic Data

Center

N

Add hundreds to

thousands of

nodes

Corporate Data

Center

Elastic Data

Center

N

Disposed of when

job completes

Corporate Data

Center

Elastic Data

Center

Results of

analysis pulled

back into your

systems

Sets new large-scale

sort record with AWS

● Databricks, founders of

Apache Spark

● Why AWS?

● EC2—fast access to large

compute, SSD, 10Gbs

network

● Agility

Mobile / Cable Telecom

Oil and Gas Industrial

Manufacturing

Retail/Consumer Entertainment

Hospitality

Life Sciences Scientific

Exploration

Financial Services

Publishing Media Advertising

Online Media Social Network

Gaming

Sling uses AWS to Store and Analyze Terabytes of Data

By using AWS, we can

make decisions about new

features and offers very

quickly and very easily.

• Needed to leverage terabytes of usage

data to generate user insights and

innovate to capture market share

• Using AWS made it possible for Sling to

offer value-add product to its partners

• Stored terabytes of analytics data

• Enabled near real-time ad hoc analytics

• Capacity to scale database immediately

Dmitry Dimov

Director, Online Services,

Sling Media

By Amazon Redshift, we can process

petabytes of data from thousands of

marketing campaigns simultaneously while

reducing operating expenses by 75%

Zhong Hong, VP,

Infrastructure and Operations, VivaKi

NDN Uses AWS to Serve 600 Million Videos to Worldwide Users

Using AWS has enabled us to build

a solid platform that has scaled

quickly while becoming a source of

profit for our customers.

• NDN, a global media exchange for

publishers and content creators,

enables 146 million users a month

to see videos online

• Ingested and stored more than

100,000 video titles per month and

served 600 million content plays a

month

• Uses Amazon Kinesis to analyze

over a billion user generated events

and page loads per day

Eric Orme

NDN COO and CTO

Financial Times Uses AWS to Reduce Infrastructure Costs by 80%

When our analysts first started

to do queries on Amazon

Redshift, they thought it was

broken because it was working

so fast.

• Needed a way to increase speed,

performance and flexibility of data

analysis at a low cost

• Using AWS enabled FT to run

queries 98% faster than

previously—helping FT make

business decisions quickly

• Easier to track and analyze trends

• Reduced infrastructure costs by

80% over traditional data center

model

John O’Donovan

CTO, Financial Times

NTT DOCOMO Delivers Voice Recognition Services to Over 60

Million Customers by Using AWS

I cannot imagine NTT

DOCOMO without the

AWS Cloud

Minoru Etoh

Senior VP, NTT DOCOMO

“• NTT DOCOMO, Inc. is the predominant

mobile phone operator in Japan

• DOCOMO launched a popular voice

recognition service and experienced

large traffic spikes in its mobile network

that impacted performance

• DOCOMO decided to migrate their whole

environment to AWS last June

• The company built a voice recognition

architecture able to scale easily to handle

spikes in traffic and serve over 60 million

customers

Kellogg Uses AWS to Save $900K Over 5 Years Over Using On-

premises Infrastructure

Using AWS saves us $900,000 in

infrastructure costs alone, and lets

us run dozens of simulations a day

so we can reduce trade spend. It’s

a win-win.

• Needed a better way to track and model

promotional costs (“trade spend”) to

improve the bottom line—and needed to

be able to run more than 1 trade-spend

simulation/day

• By using SAP HANA on AWS, Kellogg

estimates it will save $900,000 over 5

years versus traditional on-premises

infrastructure alternatives

• As well, the company can run dozens of

trade spend simulations each day, and

decreases deployment time by 30x

Stover McIlwain

Senior Director

IT Infrastructure Engineering

Baylor College of Medicine Uses AWS to Accelerate Analysis and

Discovery

We are able to power ultra large-

scale clinical studies that require

computational infrastructure in a

secure and compliant environment

at a scale not previously possible.

• Stores more than 430 TB of

genomic result data

• Analyzes the genome sequences of

more than 14,000 individuals—5

times faster than with the previous

infrastructure

• Enables more than 200 scientists

worldwide to share tools and data

quickly

Omar Serang

DNAnexus Chief Cloud Officer,

DNAnexus

“We used Amazon EMR to make

running Hadoop clusters easy,

and now we can de-dupe 10+

billion documents.

Victor Moreira,

CTO, HG Data

HG Data uses AWS to process billions of documents for BI monthly

Internet

Hadoop

Document

Crawler

Java

Document

Crawler on

EC2

Packaging on

EC2

Amazon S3MongoDB

Cluster on

EC2

Hadoop

ETL and

Analytics

ElasticSearch

Cluster on

EC2

Hadoop

Analytics

Java/PythonAnalytics

MySQL on

RDSHG API

HG WebApp

Direct Clients Enterprise

Partners

End Users

Client

Take a typical big computation task…

Big Job

…that an average cluster is too small

(or simply takes too long to complete)…

Big Job

Cluster

…optimization of algorithms can give some leverage…

Big Job

Cluster

…and complete the task in hand…

Big Job

Cluster

Applying a large cluster…

Big Job

Cluster

…can sometimes be overkill and too expensive

Big Job

Cluster

AWS instance clusters can be balanced to

the job in hand…

Big Job

…not too large…

Small Job

…nor too small…

Bigger Job

…with multiple clusters running at the same time

Why AWS for HPC?

Low cost with flexible pricing Efficient clusters

Unlimited infrastructure

Faster time to results

Concurrent clusters on-demand

Increased collaboration

Popular HPC workloads on AWS

Transcoding and

Encoding

Monte Carlo

Simulations

Computational

Chemistry

Government and

Educational Research

Modeling and

SimulationGenome Processing

Scalability on AWS

Time:+00h

Scale Using Elastic Capacity

<10 cores

Scalability on AWS

Time: +24h

Scale Using Elastic Capacity

>1500

cores

Scalability on AWS

Time:+72h

Scale Using Elastic Capacity

<10 cores

Time: +120h

Scale Using Elastic Capacity

>600 cores

Scalability on AWS

Schrodinger and CycleComputing: computational

chemistry

Simulation by Mark

Thompson of the University

of Southern California to see

which of 205,000 organic

compounds could be used

for photovoltaic cells for solar

panel material.

Estimated computation time

264 years completed in 18

hours.

• 156,314 core cluster

across 8 regions

• 1.21 petaflops (Rpeak)

• $33,000 or 16¢ per

molecule

Cost Benefits of HPC in the Cloud

Pay As You Go Model

Use only what you need

Multiple pricing models

On-Premises

Capital Expense Model

High upfront capital cost

High cost of ongoing support

Reserved

Make a low, one-

time payment

and receive a

significant

discount on the

hourly charge

For committed

utilization

Free Tier

Get started on

AWS with free

usage and no

commitment

For POCs and

getting started

On-Demand

Pay for compute

capacity by the

hour with no

long-term

commitments

For spiky

workloads,

or to define

needs

Spot

Bid for unused

capacity,

charged at a

Spot price that

fluctuates based

on supply and

demand

For time-

insensitive or

transient

workloads

Dedicated

Launch

instances within

Amazon VPC

that run on

hardware

dedicated to a

single customer

For highly

sensitive or

compliance

related

workloads

Many pricing models to support different workloads

When to consider running HPC workloads on AWS

New ideas

New HPC project

Proof of concept

New application features

Training models

Benchmarking algorithms

Remove the queue

Hardware refresh cycle

Reduce costs

Collaboration of results

Increase innovation speed

Reduce time to results

Improvement

EBS

Submit jobs, orchestrate

HPC clusters over VPC

Run 1 Million drive head

designs = 70.75 core-years

90x throughput:

Ran in 8 hours, not 30 days

3 days from idea to running

70,908 cores, 729 TFLOPS

c3, r3 with Intel E5-2670 v2

Cost: $5,594

Spot Instances

New Drive

Head

Design

Workloads

World’s Largest F500 Cloud RunTransforming drive design to store the world’s data

Encrypt, route data to

AWS, return results

Cluster

70,908 Cores

with

Spot

Instances

New

Motorcycle

Products

ASIMOPowerProducts

Honda Jet

UNI-CUB

MC-β

Automobile

Honda Smart HomeSystem (HSHS)

Dreams are the source of our courage and energy

to meet every challenge without fear of failure.

FCX

(as of March 31, 2014)

(April 2013 to March 2014)

North America

South America

Europe

China

Asia/Oceania

We had individual HPC resources

at every RandD.

Japan

Motorcycle

Power

products

Fundamental research

Aircraft ENG

Automobile

Others

Europe JapanNorth America

Asia/Oceania

China

South America

Honda DC

We consolidated HPC resources.

Overall

OptimizationGlobalization

Use forcertain period

Parallel Transient clusters

Trial use

Need a lot of cores

High memory

Lead timeNo complicated procedures and screening

Don’t have to worry about the availability of resources.

AgilityUse the AWS API and start EC2 instances quickly

Stop it anytime you want with pay-as-you-go

ServiceChoose from several EC2 instance types (including the

new types)

EC2 Spot instances

Cluster manager

Data

Spot or

On demand

Computing nodesAttached

Instance Usage

Time

Insta

nce

nu

mb

er

Anyway, use cloud

Accumulate knowledge

andplan next step

Suggest improvement to

providers

Release new services

Anyway, use cloud

Accumulate knowledge

andplan next step

Suggest improvement to

providers

Release new

services

Solution

Architects

Professional

ServicesPremium

Support

AWS Partner

Network (APN)

AWS Architectures

Reference architecture

diagrams

aws.amazon.com/architecture

http://aws.amazon.com/marketplace

Big Data Case Studies

Learn from other AWS customers

aws.amazon.com/solutions/case-studies/big-data

AWS Partner Network – Big Data Competency

Partner with an AWS Big Data expert

AWS Marketplace

AWS Online Software Store

aws.amazon.com/marketplace

Shop the big data and HPC categories

AWS Public Data Sets

Free access to big data sets

aws.amazon.com/publicdatasets

AWS Big Data

and HPC Test

Drives

APN Partner-provided labs

aws.amazon.com/testdrive

Learn from AWS big data experts

Learn how to use Apache Storm

and Amazon Kinesis to process

streaming real-time data

blogs.aws.amazon.com/bigdata

aws.amazon.com/training

Big Data Technology Fundamentals

Online Training

Big Data on AWS

Instructor-Led Training

Visit the Big Data Kiosk at the AWS Booth in the Expo Room

http://bit.ly/awsevals

Learn more about Big Data

and HPC on AWS:

aws.amazon.com/big-data

aws.amazon.com/hpc

Thank you!