Amazon Elastic MapReduceの紹介(英語)

Preview:

DESCRIPTION

『Hadoop on クラウド / Amazon Elastic MapReduceの真価』(Amazon Web Services, Jeff Barr)の資料です。http://www.eventbrite.com/event/1278974447/efblike

Citation preview

Amazon Elastic MapReduce

MY BACKGROUND

• Based in Seattle, WA

• Education:– BS in Computer Science, The American University, 1985– Graduate student in Digital Media, University of Washington, 2010

• Background:– Microsoft Visual Studio team– Consulting to startups and VC’s– Amazon employee since 2002

• Evangelist:– Speak– Write– Tweet

• Author, “Host Your Web Site in the Cloud”

• Email: jbarr@amazon.com• Twitter: @jeffbarr

• What is Big Data

• Elastic MapReduce Overview

• Example Use Cases

• Ecosystem and Tools

• Upcoming Features

• Discussion

AGENDA

• Doesn’t refer just to volume– You can benefit from Big Data infrastructure

without having a ton of data

– Many existing technologies have little problem physically handling large volumes

• Challenges result from the combination of data volume, data structure, and usage demands from that data, usually tied to timeliness

• Big Data Tools are needed to provide a holistic view of enterprise data and systematically harness it for insights and trends

WHAT IS BIG DATA?

• Enables customers to easily, securely and

cost-effectively process vast amounts of

data:

– Spin-up hundreds of instances

– Process hundreds of terabytes of data

• Hosted Hadoop framework running on

Amazon’s web-scale infrastructure

WHAT IS AMAZON ELASTIC MAPREDUCE

• Launch and monitor job flows

• AWS Management Console

• Command line interface

• REST API

WHY USE AMAZON ELASTIC MAPREDUCE

• Elastic MapReduce removes “MUCK” from Big Data processing

– Hard to manage compute clusters

– Hard to tune Hadoop

– Hard to monitor running Job Flows

– Hard to debug Hadoop jobs

– Hadoop issues prevent smooth operation in the cloud

PROBLEMS CUSTOMERS SOLVE WITH

ELASTIC MAPREDUCE

• Targeted advertising / Clickstream analysis

• Data warehousing applications

• Bio-informatics (Genome analysis)

• Financial simulation (Monte Carlo simulation)

• File processing (resize jpegs)

• Web indexing

• Data mining and BI

• Data or I/O Intensive (m1/m2 instances)

– Data Warehouse

– Data Mining

• Click stream, logs, events, etc.

• Compute or I/O Intensive (c1, cc1/HPC instances)

– Credit Ratings

– Fraud Models

– Portfolio analysis

– VaR calculation

HARDWARE REQUIREMENTS FOR USE CASES

CLICKSTREAM ANALYSIS – RAZORFISH AND BEST BUY

• Best Buy came to Razorfish– 3.5 billion records, 71 million unique cookies, 1.7 million targeted ads

required per day

Targeted Ad

User recently

purchased a

home theater

system and is

searching for

video games

(1.7 Million per day)

• Leveraged AWS and Elastic MapReduce– 100 node cluster on demand

– Processing time dropped from 2+ days to 8 hours

– Increased ROAS (Return on Advertising Spend) by 500%

CLICKSTREAM ANALYSIS - ARCHITECTURE

• Invented by Google

• New processing model

• Highly scalable

• Easy to understand

• Industry standard

• Something worth knowing

WHAT IS MAPREDUCE?

• Take input data

• Break in to sub-problems

• Distribute to worker nodes

• Worker nodes process sub-problems in parallel

• Take output of worker nodes and reduce to answer

ELASTIC MAPREDUCE MODEL – OVERVIEW

MAPREDUCE EXAMPLE – WORD COUNT

Input

Map Phase

Mapper

Mapper

Mapper

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

Sort

“This”, Doc1

“Word”, Doc1

“This”, Doc2

“This”, Doc3

“Word”, Doc3“Word”, Doc3

Reduce Phase

Reducer

Reducer

Output

“This”, 3

“Word”, 2

ELASTIC MAPREDUCE MODEL – DETAILED

ELASTIC MAPREDUCE IN ACTION – S3 LOG FILE

ELASTIC MAPREDUCE IN ACTION – STEP 1

ELASTIC MAPREDUCE IN ACTION – STEP 2

ELASTIC MAPREDUCE IN ACTION – STEP 3

ELASTIC MAPREDUCE IN ACTION – STEP 4

ELASTIC MAPREDUCE IN ACTION – STEP 5

ELASTIC MAPREDUCE IN ACTION – STEP 6

ELASTIC MAPREDUCE IN ACTION – STEP 7

ELASTIC MAPREDUCE IN ACTION - RESULTS

• Mapper and Reducer in Java JAR files

• Scale as large as needed

– Data

– Processing

– Add nodes (even while running) to speed up

• No need to manage intermediate data

• Suitable for certain types of problems

– Record-oriented input

– No dependencies between records

• No more MUCK – focus on your problem

NOTES / ATTRIBUTES

HADOOP + R

Thank You

Recommended