Big Data and High Performance Computing Solutions in the AWS Cloud

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Big Data and High Performance Computing

Solutions in the AWS Cloud

Ben Butler, Sr. Mgr. Big Data & HPC Marketing

@bensbutler March 26, 2014

Tell us:What’s good, what’s not

What you want to see at these events

What you want AWS to deliver for you

Your feedback is very important to us

Big Data HPC

Customer Success

StoryGetting Started on

AWS

What we’ll cover today…

Big Data HPC

Customer Success


AWS


Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation




GB TBPB

95% of the 1.2 zettabytes of data in the digital universe is unstructured

70% of of this is user-generated content

Unstructured data growth explosive, with estimates of compound annual growth (CAGR) at 62% from 2008 –2012. Source: IDC

ZB

EB

Big Data: Unconstrained data growth

Lower cost,

higher throughput Generation




Customer segmentation

Marketing spend optimization

Financial modeling & forecasting

Ad targeting & real time bidding

Clickstream analysis

Fraud detection

Use Cases

Visits, views, clicks, purchases

Source, device, location, time

Latency, throughput, uptime

Likes, shares, friends, follows

Price, frequency

Metrics

Relational

NoSQL

Web servers

Mobile phones

Tablets

3rd party feeds

Sources

Structured

Unstructured

Text

Binary

Near Real-time

Batched

Formats

Reporting

Dashboards

Sentiment

Clustering

Machine Learning

Optimization

Analysis

Lower cost,

higher throughput

Highly

constrained

Generation




Generated data

Available for analysis

Data volume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Elastic and highly scalable

No upfront capital expense

Only pay for what you use+

+

Available on-demand

+

=Remove

constraints

Accelerated

Generation




Technologies and techniques for

working productively with data,

at any scale.

Big Data

Big data and AWS cloud computing

Big data Cloud computing

Variety, volume, and velocity

requiring new tools

Variety of compute, storage,

and networking options



Potentially massive datasets Massive, virtually unlimited capacity



Iterative, experimental style of

data manipulation and analysisIterative, experimental style of

infrastructure deployment/usage



Frequently not a steady-state

workload; peaks and valleys

At its most efficient with highly

variable workloads



Absolute performance not as

critical as “time to results”; shared

resources are a bottleneck

Parallel compute projects allow each

workgroup to have more autonomy,

get faster results

Ease of useLower costs

no capital investment

pay as you go

no subscriptions

only pay for what you use


programmable

zero admineasy to

configure

integrate with

existing tools


One tool to rule them all

Use the right tools

Amazon

S3

Amazon

Kinesis

Amazon

DynamoDB

Amazon

Redshift

Amazon

Elastic

MapReduce

Store anything

Object storage

Scalable

99.999999999% durability

Amazon S3

Real-time processing

High throughput; elastic

Easy to use

EMR, S3, Redshift, DynamoDB

Integrations

Amazon

Kinesis

NoSQL Database

Seamless scalability

Zero admin

Single digit millisecond latency

Amazon

DynamoDB

Relational data warehouse

Massively parallel

Petabyte scale

Fully managed

$1,000/TB/Year

Amazon

Redshift

Hadoop/HDFS clusters

Hive, Pig, Impala, Hbase

Easy to use; fully managed

On-demand and spot pricing

Tight integration with S3,

DynamoDB, and Kinesis

Amazon

Elastic

MapReduce

HDFS

Analytics

languages

Data

management

Amazon

RedShift

Amazon EMRAmazon

RDS

Amazon S3 Amazon

DynamoDB

Amazon

Kinesis

SourcesSourcesData

Sources

AWS Data Pipeline

Bizo: Digital Ad. Tech Metering with Amazon Kinesis

Continuous Ad

Metrics Extraction

Incremental Ad

Statistics

Computation

Metering Record Archive

Ad Analytics Dashboard

Free steak campaign

Facebook page

Mars exploration ops

Consumer social app

Ticket pricing optimizationSAP & SharePoint Securities Trading Data Archiving

Marketing web site Interactive TV apps Financial markets analytics

Consumer social app Big data analytics

Web site & media sharing

Disaster recovery

Media streaming Web and mobile apps

Streaming webcasts

Facebook app Consumer social app

Business line of sight Mobile analytics

IT operations Digital media

Core IT and media

Ground campaign

Generation




Generation




Amazon

GlacierS3

AmazonDynamoDB

Amazon

RDSAmazon

Redshift

AWS

Direct Connect

AWS

Storage Gateway

AWS

Import/ Export

Amazon Kinesis Amazon EMR

Generation




Amazon EC2 Amazon EMRAmazon Kinesis

Generation



Collaboration & sharingAmazon

CloudFront

AWS

CloudFormation

S3

AmazonDynamoDB

Amazon

RDSAmazonRedshift

Amazon EC2 Amazon EMR

AWS

Data Pipeline

The right tools. At the right scale. At the right time.

Big Data HPC

Customer Success


AWS


Take a typical big computation task…

…that an average cluster is too small

(or simply takes too long to complete)…

…optimization of algorithms can give some leverage…

…and complete the task in hand…

Applying a large cluster…

…can sometimes be overkill and too expensive

AWS instance clusters can be

balanced to the job in hand…

…nor too large…

…nor too small…

…with multiple clusters running at the same time

Why AWS for HPC?

Low cost with flexible pricing Efficient clusters

Unlimited infrastructure

Faster time to results

Concurrent Clusters on-demand

Increased collaboration

Cluster compute instancesImplement HVM process execution

Intel® Xeon® processors

10 Gigabit Ethernet –C3 has Enhanced Networking, SR-IOV

cc2.8xlarge

32 vCPUs

2.6 GHz Intel Xeon

E5-2670 Sandy Bridge

60.5 GB RAM

4 x 840 GB

Local HDD

c3.8xlarge

32 vCPUs

2.8 GHz Intel Xeon

E5-2680v2 Ivy Bridge

60GB RAM

2 x 320 GB

Local SSD

AWS High Performance Computing

c3.8xlarge

32 vCPUs

2.8 GHz Intel Xeon


60GB RAM

2 x 320 GB

Local SSD

Top 500 Super Computer using Amazon EC2

64th fastest supercomputer, Nov 201326,496 Intel® Xeon® cores

Linpack Performance (Rmax) 484.2 TFlop/s

Theoretical (Rpeak) 593.5 Tflops/s

c3.8xlarge

32 vCPUs

2.8 GHz Intel Xeon


60GB RAM

2 x 320 GB

Local SSD

c3.8xlarge

32 vCPUs

2.8 GHz Intel Xeon


60GB RAM

2 x 320 GB

Local SSD

c3.8xlarge

32 vCPUs

2.8 GHz Intel Xeon


60GB RAM

2 x 320 GB

Local SSD

Network placement groups

Cluster instances deployed in a Placement

Group enjoy low latency, full bisection

10 Gbps bandwidth

10GbpsAWS High Performance Computing

GPU compute instances

cg1.4xlarge

Intel® Xeon® X5570

33.5 vCPUs

22.5GB RAM

2x NVIDIA GPU

448 Cores

3GB Mem

g2.2xlarge

Intel® Xeon E5-2670

8vCPUs

15GB RAM

1x NVIDIA GPU

1536 Cores

4GB Mem

G2 instances

1 NVIDIA Kepler GK104 GPU

I/O Performance: Very High (10 Gigabit Ethernet)

CG1 instances

2 x NVIDIA Tesla “Fermi” M2050 GPUs

I/O Performance: Very High (10 Gigabit Ethernet)

AWS High Performance Computing

HPC Partners and Apps

Making Production Cloud HPC easy from 64 cores to

…

PharmaJohnson &

Johnson

ManufacturingHGST, a Western

Digital Company

Financial ServicesPacific Life Insurance

GenomicsLife Technologies

ResearchThe Aerospace

Corporation

… 156,314 cores for better solar panel materials for $33k, not $68M

Amazon EC2

16,788 Spot

Instances

Amazon S3

4TB

Processed

Spot Instances

on all 8 Regions

1.21 PetaFLOPS

Intel SandyBridge

on CC2

Big Data HPC

Customer Success


AWS


© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

AWS Customer Success Story

David Hinz, Director Cloud and HPC Solutions

HGST, Inc3/25/14

Founded in 2003 through the combination of the hard drive

businesses of IBM, the inventor of the hard drive, and

Hitachi, Ltd (“Hitachi”)

Acquired by Western Digital in 2012

More than 4,200 active worldwide patents

Headquartered in San Jose, California

Approximately 41,000 employees worldwide

Develops innovative, advanced hard disk drives, enterprise-class

solid state drives, external storage solutions and services

Delivers intelligent storage devices that tightly integrate hardware

and software to maximize solution performance

6

2

Capacity Enterprise

Performance Enterprise

Cloud & Datacenter

Enterprise SSD(+3 acquisitions in 2013)

7200 RPM &

CoolSpin

HDDs

Ultrastar®

Ultrastar® &

MegaScale DC™

10K & 15K

HDDs

PCIe

SAS

6

3

April 2013

Zero to Cloud in less than 12 MonthBy 31 Dec 2013:

Cloud eMail – Microsoft Office365

Cloud eMail archiving/eDiscovery

External SingleSignOn (off VPN)

Cloud File/Collaboration – BOX

USe– Salesforce.com

Integrated to save files in BOX

Cloud–High Performance Computing

(HPC) on AWS

Cloud – Big Data Platform on AWS

Cloud - data mart and provisioning service,

with AWS Red Shift

Evolution of Data Centers and HPC @ HGST

SJC

Servo

Team

Japan

Servo

TeamUS

Head

Team

US

HAMR

Team

MN

PCB

Team

Old

Mail

System

HGST

Datacenters

On Premise

Off Premise

HPC

Clusters

An Agile Enterprise Datacenter Integrating

On-Premise and Cloud Solutions

Servo

Team in

SJC

PCB

Team in

MN

HAMR

Team in

US

Head

Team in

US

Servo

Team in

Jpn

Evaluate New Storage Technologies and

Solutions “In House” (HDD, SSD, etc.)HGST On Site

Business, Production and

Enterprise Computing

Siloes of

Clusters

On Premise

Internal

Wiki’s,

etc.

Cloud

HPC: Molecular Dynamics Simulation

• HGST uses Molecular Dynamics Simulation for

RnD of materials and lubricants needed for HDD’s

• Research to achieve higher memory densities,

faster read/write capabilities, smaller form factors

and lower power consumption

“Model Job Size”

used at HGST

Complexity

[atoms]

Number of

Time Steps Job Type “Frequency”

Small 300,000 100 200 per day, 2 days per week once or twice a month

Medium 300,000 1000 20 Medium jobs during the day, 4 days per month

Large 300,000 30000 3 large jobs per day, 6 days per month

Very Large 300,000 3000000 1 large job per month

Time

Before: Shared 512 Core

Super Computer

512 core 512core 512core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

64 core

Today: AWS EC2 CC2

(Max Total 512 core)

512corewaiting

256 core 256 core

128 core 128 core

2W

waiting

waiting

All Jobs Run In Parallel on AWS 1.67x Throughput Improvement

Shape Compute To Match Work To Be Done

HPC: Micro Magnetic Simulation

• Model new technologies for future

HGST HDD products

• Finite-difference time-domain (FDTD)

numerical analysis solver – Accurately simulation of large, complex

models of many variable parameters and

materials

– Scale across large clusters

• AWS C3 Instances provide significant

improvement for both scalability and

simulation throughput

AWS C3 Instances Provided 1.5x or Better Simulation Performance

“Cloud HPC”: What’s Next…..

Deploy Graphics User Interfaces

HPC Applications

• Pre and Post Processing in Cloud vs.

data migration back to local systems

AWS C3 Performance Validated Across

Many Applications• Improve overall performance and reduce

monthly AWS compute bill

• “Reduce Data Search Parties”

– Stop playing “Where’s Waldo with your Data”

• “ I know I have that data….. somewhere?”

– Data Aggregation to a Common Platform with common access tools

• Improve Yields by Accessing More Data in a More Timely Manner

– By having end to end visibility to:

• Every test, every diagnostic and all info from all components of a product (internal and external)

– Speed up yield improvement ramp up on new products

– Improve steady state yield on existing products.

“Big Data” in Manufacturing

6

• Metrics:

– Collecting >2M manufacturing/testing binary files daily

– Collecting from ~500 tables across 6 databases tens of millions of records daily

– Over 140 users to date in early piloting

– Over 150 attendees participated in BDP training

• Highlights: although early in the overall journey, HGST’s BDP is already demonstrating early benefits:

HGST: BDP Key Metrics and Highlights

7

Development Engineer: demonstrated the joining of data sets for detailed

logistics tracking—analyses that is very difficult to conduct with current

systems

Ops Engineer: a recent production issue required detailed historical data. Current systems

did not have the required retention for this data. However, the team was able to pull the data

from the BDP in minutes, as opposed to 3+ weeks to pull the data from tape archive

Development Engineer: obtained technical data from the BDP in

hours as opposed to 3+ weeks to pull from tape archive

DATA SEARCH

PARTIES

YIELD

3. Tailor Data for Consumers

• With the base Big Data platform established, the focus shifts to enabling specific business use cases.

The typical pattern involves:

HGST’s BDP Journey

7

1. Collect Data

Core Data Processing

4. Develop Consumers

Derived DB

Enriched Hive

tables

Hadoop

Analytics Libraries

Dimension

Reduction

Hive

Batch Analytics

Python

R

...

Sampling

Custom

Websites

Specific reports /

visualizations

Specific

analytics

Co

re d

ata

Hive / API

Core Data Processing

The next phase will help to build the

specific website/reports/visualizations

that are tailored to the specific

business use case

The core effort to date has focused on building

the platform, ingesting core data sets, providing

base visualization/data mining tools, and

beginning to prepare the data for specific use

cases

2. Update Core API

Early

Successes

From Here

Commercial HPC Applications: Cloud Ready?

• HPC environments desire Cloud Computing with in-

house machines

– “Hybrid” Data Centers

• On-premise workstations + clusters

(some legacy, some new) with

burst/over-flow/connection to Cloud

– EULAs

• Should Comprehend Cloud

• Should allow License server placement

in cloud and accessible on-premise

• Make it easy to add cloud computing to

current licenses

• Consumption Based Pricing

– No consistency across vendors

– Not aligned with time based consumption pricing

“We’ve Only Just Begun….”

• Current Results in less than 12 months

• Re-aligning Business Group Leadership, Development Teams, Research and Development Teams on New Capabilities Model

• Demands and Uses Expected To Grow And Accelerate Market Success

73

2013 “Heavy Lifting” Provides Foundation

for 2014 Acceleration

Big Data HPC

Customer Success


AWS


Solution

Architects

Professional

Services

Premium

Support

AWS Partner

Network (APN)

AWS is here to help

AWS Architecture Diagrams

https://aws.amazon.com/architecture/

Processing large amounts of parallel data using a scalable cluster

Use commonly-available cluster

scheduling tools, such as

Grid Engine or Condor

AWS Online Software Store

http://aws.amazon.com/marketplace

Big Data Case Studies

Learn from other AWS customers

https://aws.amazon.com/solutions/case-

studies/big-data


https://aws.amazon.com/marketplace

AWS Marketplace


http://aws.amazon.com/marketplace

AWS Public Data Sets

Free access to big data sets

https://aws.amazon.com/publicdatasets

AWS in Education

https://aws.amazon.com/grants

AWS Grants Program


AWS Big Data Test Drives

APN Partner-provided labs

https://aws.amazon.com/testdrive/bigdata

Webinars, Bootcamps, and

Self-Paced Labs

https://aws.amazon.com/training

AWS Training & Events

https://aws.amazon.com/events


Big Data to AWS

Brand new course on Big Data

https://aws.amazon.com/training/course-

descriptions/bigdata/

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

https://aws.amazon.com/big-data

https://aws.amazon.com/hpc

@bensbutler (both Twitter and LinkedIn)

Thank you!

Technology

Big Data and High Performance Computing Solutions in the AWS Cloud