(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

November 13th, 2014 | Las Vegas, NV

Ian Meyers, Amazon Web Services

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Amazon Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster

Integrates with Amazon S3, Amazon DynamoDB, Amazon

Kinesis and Amazon Redshift

Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user

tools automatically

Native support for Spot Instances

Integrated HBase NoSQL database

Amazon EMR

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file – merge values in new config to existing

--keyword-key-value – override values provided

Configuration File NameConfiguration File

KeywordFile Name Shortcut Key-Value Pair Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Set number of mappers per task tracker

Useful for small memory footprint map tasks

More work done with a given instance

Set HDFS block size to 1MB

Useful for smaller files when HDFS is used

Reuse mappers

Mapper startup time ~ 2-20 seconds

Useful for tasks with large number of mappers

Mappers must be “clean” after run (relevant for Java)

Configure process heap size, Java opts, and allow for replacing the hadoop-user-env.sh

Hadoop 1

Hadoop 2

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons--args –{namenode}-heap-size=2048,--{namenode}-opts=-XX:GCTimeRatio=19

Amazon EMR

Amazon S3 Amazon

DynamoDB

Processed Files

Registry

File Data

≈60sec * 15MB 1GB

aws emr add-steps --cluster-id <cluster>

--steps Name=GroupSmallFiles,

Type=CUSTOM_JAR,

Args=files,home/hadoop/lib/emr-s3distcp-1.0.jar,

src,s3://myawsbucket/cf,

dest,hdfs:///local,

groupBy,.*(i-\w.log).*,

targetSize,128…

Algorithm % Space

Remaining

Encoding

Decoding

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

-outputCodec,lzo

Amazon EMR Cluster

Task Instance

Core Instance

Amazon S3

HUGE Benefit!!

Amazon

Amazon EMR Cluster

Task Instance

Core Instance

Amazon S3

Hive 0.13.1• Support for ORC

• Window functions

• Decimal types

• TRUNCATE command

• Better optimiser (less

need for hinting)

Pig 0.12.0• Streaming UDF’s not

written in Java

• Native support for Avro

• Native support for

Parquet

• Improved data types

Impala 1.1 • In-memory SQL engine

• Support for HBase

tables

• Support for Parquet –

column-oriented file

format

• Query and interactive

shells

HBase 0.94.18• Database

Snapshotting

• Improved read caching

and seek optimisation

• Improved transactions

Read Data Directly into Hive,

Pig, Streaming and Cascading

from Kinesis Streams

No Intermediate Data

Persistence Required

Simple way to introduce real time sources into

Batch Oriented Systems

Multi-Application Support & Automatic

Checkpointing

Amazon EMR Integration with Amazon Kinesis

drop table call_data_records;

CREATE TABLE call_data_records (start_time bigint,end_time bigint,phone_number STRING,carrier STRING,recorded_duration bigint,calculated_duration bigint,lat double,long double

)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ","STORED BY'com.amazon.emr.kinesis.hive.KinesisStorageHandler'TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");

Amazon EMR Integration with Amazon Kinesis

EC2 InstanceMap

Reduce

m1.small 2 1

m1.large 3 1

m1.xlarge 8 3

m2.xlarge 3 1

m2.2xlarge 6 2

m2.4xlarge 14 4

m3.xlarge 6 1

m3.2xlarge 12 3

cg1.4xlarge 12 3

cc2.8xlarge 24 6

c3.4xlarge 24 6

hi1.4xlarge 24 6

hs1.8xlarge 24 6

cr1.8xlarge &

c3.8xlarge48 12

Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Instance Cost / Map Task Cost / Reduce Task

m1.large $0.08 $0.15

m1.xlarge $0.06 $0.15

m3.xlarge $0.04 $0.07

m3.2xlarge $0.04 $0.07

Instance Cost / Map Task Cost / Reduce Task

c1.medium $0.13 $0.13

c1.xlarge $0.35 $0.70

c3.xlarge $0.05 $0.11

c3.2xlarge $0.05 $0.11

Total tasks * Time to process sample files

Instance task capacity * Desired processing time

Estimated number of nodes:

1. Estimate the number of tasks your job requires

2. Pick an instance and note down the number of Tasks it can run in parallel

m1.xlarge with 8 task capacity per instance

3. We need to pick some sample data files to run a

test workload. The number of sample files should

be the same number from step #2.

8 files selected for our sample test

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3.

Note down the amount of time taken to process

this dataset.

3 min to process 8 files

Total tasks for your job * Time to process sample files

Per instance task capacity * Desired processing time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

Master instance group

Amazon EMR cluster

HDFS HDFS

Run TaskTrackers

(Compute)

Run DataNode

(HDFS)

Core instance group

Can add core nodes

More HDFS space

More CPU/memory

Amazon EMR cluster

HDFS HDFS HDFS

Core instance group

Can’t remove core

nodes because of

HDFS HDFS HDFS

Amazon EMR cluster

Core instance group

Run TaskTrackers

No HDFS

Reads from core node

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Can add task

HDFS HDFS

Amazon EMR cluster

More CPU power

More memory

HDFS HDFS

Amazon EMR cluster

You can remove

task nodes when

processing is

completed

Task instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

You can remove

task nodes when

processing is

completed

HDFS HDFS

Amazon EMR cluster

Amazon

CloudWatch

http://bit.ly/awsevals

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Technology

Deep Dive: Amazon Elastic MapReduce

AWS re:Invent 2016: Deep Dive on Amazon DynamoDB (DAT304)

AWS re:Invent 2016: Deep Dive on Amazon Aurora (DAT303)

AWS re:Invent 2016: Deep Dive: AWS Direct Connect and VPNs (NET402)

(SDD409) Amazon RDS for PostgreSQL Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: Elastic Load Balancing Deep Dive and Best Practices (NET403)

(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization Best Practices (JKT301)

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013

AWS Billing Deep Dive (DMG203) | AWS re:Invent 2013

AWS re:Invent 2016: Deep Dive on Amazon Elastic File System (STG202)

(SDD403) Amazon RDS for MySQL Deep Dive | AWS re:Invent 2014

AWS re:Invent 2016: Amazon Aurora Deep Dive (GPST402)

(SDD418) Amazon CloudWatch Deep Dive | AWS re:Invent 2014

(SDD412) Amazon Simple Email Service Deep Dive and Best Practices | AWS re:Invent 2014

Deep Dive into Amazon ElastiCache Architecture and Design Patterns (DAT307) | AWS re:Invent 2013

Amazon Aurora Deep Dive (re:Invent 2015 DAT405 日本語翻訳版)

(SDD408) Amazon Route 53 Deep Dive: Delivering Resiliency, Minimizing Latency | AWS re:Invent 2014

(SDD420) Amazon WorkSpaces: Advanced Topics and Deep Dive | AWS re:Invent 2014