Hadoop and HBase on Amazon Web Services

Preview:

DESCRIPTION

Introducing big data and analytics with Hadoop, Hbase and Amazon Elastic Mapreduce.

Citation preview

Thank you.

Introducing Hadoop3

HBase on AWSg

Introducing Hadoop3

Cost optimizationv

HBase on AWSg

Introducing Hadoop3

Data for competitive advantage.

Customer segmentation, financial modeling, system analysis,line-of-sight,business intelligence...

Using data

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Cost of data generationis falling.

lower cost, increased throughput

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

HIGHLY CONSTRAINED

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Very high barrier to turning data into information.

Move from a data generation challengeto analytics challenge.

Enter the AWS Cloud.

Remove the constraints.

Enable data-driven innovation.

Move to a distributed data approach.

Maturation of two things.

Maturation of two things.

Software for distributed storage and analysis

Maturation of two things.

Software for distributed storage and analysis

Infrastructure for distributed storage and analysis

Frameworks for data-intensive workloads.

Software

Distributed by design.

Platform for data-intensive workloads.

Infrastructure

Distributed by design.

Support the data life cycle.

HIGHLY CONSTRAINED

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Lower the barrier to entry.

Accelerate time to market and increase agility.

Enable new business opportunities.

Washington Post

Pinterest

NASA

“AWS enables Pfizer to explore di!cult or deep scientific questions in a timely, scalable manner and helps us make better decisions more quickly”

Michael Miller, Pfizer

Introducing Hadoop3

Maturation of two things.

Software for distributed storage and analysis

Infrastructure for distributed storage and analysis

Maturation of two things.

Software for distributed storage and analysis

Infrastructure for distributed storage and analysis

Apache Hadoop

Software for distributed storage and analysis

Implements the map/reduce pattern

Focus on your data

Built for uncertainty

Hadoop provides tools to navigate data

Allows discovery

Query flexibility at scale

Built for flexibility

Java native

Executes code in any language

Just a distribution mechanism

Rich ecosystem

Diverse tools

Machine learning, recommendations, predictive analytics, segmentation, real time analysis

Lots of innovation

But...

A very big project

500k+ lines of code

Challenging to configure and optimize

Undi!erentiated heavy liftingG

Amazon Elastic MapReduce

Amazon Elastic MapReduce

Web service for data processing

Hosted Hadoop

Configured and optimized

Amazon Elastic MapReduce

Job flows

Elastic platform

Maintain clusters or run once and terminate

Debugging tools

Input data

S3

Elastic MapReduce

Code

Input data

S3

Elastic MapReduce

Code Name node

Input data

S3

Elastic MapReduce

Code Name node

Input data

S3

Elastic cluster

Elastic MapReduce

Code Name node

Input data

S3

Elastic cluster

HDFS

Elastic MapReduce

Code Name node

Input data

S3

Elastic cluster

HDFSQueries

+ BIVia JDBC, Pig, Hive

Elastic MapReduce

Code Name node

OutputS3 + SimpleDB

Input data

S3

Elastic cluster

HDFSQueries

+ BIVia JDBC, Pig, Hive

OutputS3 + SimpleDB

Input data

S3

Hadoop all the way down

Amazon Hadoop distribution

HDFS

Streaming interface

Hive, Pig, Mahout, Spark, Shark

Data integration

Optimized and integrated into AWS environment

Reads and writes to S3

Analytics on DynamoDB data

Can process data from any source: Cassandra, Mongo, Couch, Amazon RDS

Data movement

Multi-part upload

Import/Export

AWS Direct Connect

Aspera

Cluster scalability

Resize running job flows

Add capacity for shorter runs

Remove capacity during o! peak hours

Balance scale and cost

Cluster scalability

14 hours remaining

Cluster scalability

7 hours remaining

Cluster scalability

3 hours remaining

Cluster scalability

Steady state Steady stateLarge batch task

Cluster availability

Canonical source of data

Any one in the engineering team

IAM integration

Monitoring

Click stream analysis for retail

3.5 billion records71 million unique cookies1.7 million targeted ads

13 Tb of clickstream logs

Each day

Click stream analysis for retail

Workflow time from 2 days to 8 hours

Procurement time from 2 months to 5 minutes

$13k per month

500% increase return on advertising spend

Months of user click-through data Search terms Ads displayed Premium listing inventory

Amazon S3

Log data stored in Amazon S3

Hadoop Cluster

Amazon EMR Amazon S3

Elastic Map Reduce spins up 200 instance cluster

Hadoop Cluster

Amazon EMR Amazon S3

Find patterns across logs. Write results to S3.

Hadoop in the AWS Cloud

Elastic MapReduce for hosted Hadoop

Optimized, configured, ready to roll

Focus on the business benefit of data

Hadoop all the way down

Maturation of two things.

Software for distributed storage and analysis

Infrastructure for distributed storage and analysis

HBase on AWSg

Vibrant ecosystem

Mahout for machine learning

Mesos for cluster management

Spark for fast analytics

HBase for unstructured data

HBase

NoSQL data store

Runs on top of HDFS

Scalable

Rapid retrieval across large datasets

Architecture

Huge, distributed map/hash

Distributed

Implements Bloom filters

Sortable

Column based

Columns are similar to fields

Rows are records

Built for data

Built to scale across billions of rows

The more data, the better the relative performance

But...

Large, complex project

Running in production can be challenging

Distributed system

Undi!erentiated heavy liftingG

HBase for Elastic MapReduce

Using HBase

Social media firehose

Customer information

Usage and application logs

Hadoop analytics

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Amazon DynamoDB

NoSQL database service

Provisioned throughput

Unlimited storage

Very easy to use

DynamoDB & Amazon EMR

SQL like queries

Query flexibility at scale

Integrate queries across datasets

Hive

NoSQL on the AWS Marketplace

CouchDB

Cassandra

MongoDB

aws.amazon.com/marketplace

Cost optimizationv

Lowered prices 19 times in the past six years.

On-demand

Reserved capacity

100%

Reserved capacity

100%

Reserved capacity

On-demand

100%

Reserved capacity

On-demand

Spot market

$0.08 vs $0.007(yesterday evening)

Reserved Instance Marketplace

Cost optimizationv

HBase on AWSg

Introducing Hadoop3

aws.amazon.com/elasticmapreduceB

Recommended