(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Preview:

DESCRIPTION

Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

Citation preview

November 13th, 2014 | Las Vegas, NV

Ian Meyers, Amazon Web Services

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Amazon Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster

Integrates with Amazon S3, Amazon DynamoDB, Amazon

Kinesis and Amazon Redshift

Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user

tools automatically

Native support for Spot Instances

Integrated HBase NoSQL database

Amazon EMR

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop

--keyword-config-file – merge values in new config to existing

--keyword-key-value – override values provided

Configuration File NameConfiguration File

KeywordFile Name Shortcut Key-Value Pair Shortcut

core-site.xml core C c

hdfs-site.xml hdfs H h

mapred-site.xml mapred M m

yarn-site.xml yarn Y y

Set number of mappers per task tracker

Useful for small memory footprint map tasks

More work done with a given instance

Set HDFS block size to 1MB

Useful for smaller files when HDFS is used

Reuse mappers

Mapper startup time ~ 2-20 seconds

Useful for tasks with large number of mappers

Mappers must be “clean” after run (relevant for Java)

Configure process heap size, Java opts, and allow for replacing the hadoop-user-env.sh

Hadoop 1

Hadoop 2

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons--args –{namenode}-heap-size=2048,--{namenode}-opts=-XX:GCTimeRatio=19

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

Processed Files

Registry

File Data

55

5

≈60sec * 15MB 1GB

aws emr add-steps --cluster-id <cluster>

--steps Name=GroupSmallFiles,

Type=CUSTOM_JAR,

Args=files,home/hadoop/lib/emr-s3distcp-1.0.jar,

src,s3://myawsbucket/cf,

dest,hdfs:///local,

groupBy,.*(i-\w.log).*,

targetSize,128…

Algorithm % Space

Remaining

Encoding

Speed

Decoding

Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

-outputCodec,lzo

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDF

S

HDF

S

Amazon S3

HUGE Benefit!!

EMR

EMR

Amazon

S3

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDF

S

HDF

S

Amazon S3

S3D

istC

P

S3D

istC

P

EMR

HDFS

Pig

Hive 0.13.1• Support for ORC

• Window functions

• Decimal types

• TRUNCATE command

• Better optimiser (less

need for hinting)

Pig 0.12.0• Streaming UDF’s not

written in Java

• Native support for Avro

• Native support for

Parquet

• Improved data types

Impala 1.1 • In-memory SQL engine

• Support for HBase

tables

• Support for Parquet –

column-oriented file

format

• Query and interactive

shells

HBase 0.94.18• Database

Snapshotting

• Improved read caching

and seek optimisation

• Improved transactions

Read Data Directly into Hive,

Pig, Streaming and Cascading

from Kinesis Streams

No Intermediate Data

Persistence Required

Simple way to introduce real time sources into

Batch Oriented Systems

Multi-Application Support & Automatic

Checkpointing

Amazon EMR Integration with Amazon Kinesis

drop table call_data_records;

CREATE TABLE call_data_records (start_time bigint,end_time bigint,phone_number STRING,carrier STRING,recorded_duration bigint,calculated_duration bigint,lat double,long double

)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ","STORED BY'com.amazon.emr.kinesis.hive.KinesisStorageHandler'TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream");

Amazon EMR Integration with Amazon Kinesis

EC2 InstanceMap

Tasks

Reduce

Tasks

m1.small 2 1

m1.large 3 1

m1.xlarge 8 3

m2.xlarge 3 1

m2.2xlarge 6 2

m2.4xlarge 14 4

m3.xlarge 6 1

m3.2xlarge 12 3

cg1.4xlarge 12 3

cc2.8xlarge 24 6

c3.4xlarge 24 6

hi1.4xlarge 24 6

hs1.8xlarge 24 6

cr1.8xlarge &

c3.8xlarge48 12

1

2

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

0

50

100

150

200

250

300

Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Instance Cost / Map Task Cost / Reduce Task

m1.large $0.08 $0.15

m1.xlarge $0.06 $0.15

m3.xlarge $0.04 $0.07

m3.2xlarge $0.04 $0.07

Instance Cost / Map Task Cost / Reduce Task

c1.medium $0.13 $0.13

c1.xlarge $0.35 $0.70

c3.xlarge $0.05 $0.11

c3.2xlarge $0.05 $0.11

Total tasks * Time to process sample files

Instance task capacity * Desired processing time

Estimated number of nodes:

1. Estimate the number of tasks your job requires

150

2. Pick an instance and note down the number of Tasks it can run in parallel

m1.xlarge with 8 task capacity per instance

3. We need to pick some sample data files to run a

test workload. The number of sample files should

be the same number from step #2.

8 files selected for our sample test

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3.

Note down the amount of time taken to process

this dataset.

3 min to process 8 files

Total tasks for your job * Time to process sample files

Per instance task capacity * Desired processing time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

Master instance group

Amazon EMR cluster

HDFS HDFS

Run TaskTrackers

(Compute)

Run DataNode

(HDFS)

Core instance group

Can add core nodes

More HDFS space

More CPU/memory

Master instance group

Amazon EMR cluster

HDFS HDFS HDFS

Core instance group

Can’t remove core

nodes because of

HDFS

Master instance group

HDFS HDFS HDFS

Amazon EMR cluster

Core instance group

Run TaskTrackers

No HDFS

Reads from core node

HDFS

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Can add task

nodes

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

More CPU power

More memory

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

You can remove

task nodes when

processing is

completed

Task instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

You can remove

task nodes when

processing is

completed

Master instance group

HDFS HDFS

Amazon EMR cluster

Task instance groupCore instance group

Amazon

CloudWatch

http://bit.ly/awsevals

Recommended