MapReduce and Hadoop

Map Reduce and Hadoop

- SALIL NAVGIRE

Big Data Explosion• 90% of today's data was created in the

last 2 years

• Moore's law: Data volume doubles every 18 months

• YouTube: 13 million hours and 700 billion views in 2010

• Facebook: 20TB/day (compressed)

• CERN/LHC: 40TB/day (15PB/year)

• Many more examples

Solution: Scalability

How?

Divide and Conquer

Challenges!• How to assign units of work to the workers?

• What if there are more units of work than workers?

• What if the workers need to share intermediate incomplete data?

• How do we aggregate such intermediate data?

• How do we know when all workers have completed their assignments?

• What if some workers failed?

History• 2000: Apache Lucene: batch index updates

and sort/merge with on disk index

• 2002: Apache Nutch: distributed, scalable open source web crawler

• 2004: Google publishes GFS and MapReduce papers

• 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project

What is Map Reduce?• A programming model to distribute a task

on multiple nodes

• Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes

• Original MapReduce paper by Google

• Features of MapReduce:• Fault-tolerance

• Status and monitoring tools

• A clean abstraction for programmers

MapReduce Execution Overview

UserProgram

Worker

Worker

Master

Worker

Worker

Worker

fork fork fork

assignmap

assignreduce

readlocalwrite

remoteread,sort

OutputFile 0

OutputFile 1

write

Split 0Split 1Split 2

Input Data

Hadoop Components

HDFS

Storage

Self-healinghigh-bandwidthclustered storage

MapReduce

Processing

Fault-tolerantdistributedprocessing

HDFS Architecture

HDFS Basics• HDFS is a filesystem written in Java

• Sits on top of a native filesystem

• Provides redundant storage for massive amounts of data

• Use Commodity devices

HDFS Data• Data is split into blocks and stored on multiple nodes in the cluster

• Each block is usually 64 MB or 128 MB

• Each block is replicated multiple times

• Replicas stored on different data nodes

2 Types of Nodes

Master Nodes

Slave Nodes

Master Node• NameNode

• only 1 per cluster

• metadata server and database

• SecondaryNameNode helps with some housekeeping

• JobTracker• only 1 per cluster

• job scheduler

Slave Nodes• DataNodes

• 1-4000 per cluster

• block data storage

• TaskTrackers

• 1-4000 per cluster

• task execution

NameNode• A single NameNode stores all metadata, replication of blocks and read/write access to files

• Filenames, locations on DataNodes of each block, owner, group, etc.

• All information maintained in RAM for fast lookup

Secondary NameNode• Does memory-intensive administrative functions for the NameNode

• Should run on a separate machine

Data Node• DataNodes store file contents

• Different blocks of the same file will be stored on different DataNodes

• Same block is stored on three (or more) DataNodes for redundancy

Word Count Example• Input• Text files

• Output• Single file containing (Word <TAB> Count)

• Map Phase• Generates (Word, Count) pairs

• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase• For each word, calculates aggregate

• [{a,7}, {b,5}, {c,6}]

Typical Cluster• 3-4000 commodity servers

• Each server • 2x quad-core

• 16-24 GB ram

• 4-12 TB disk space

• 20-30 servers per rack

When Should I use it?Good choice for jobs that can be broken into parallelized jobs:

• Indexing/Analysis of log files

• Sorting of large data sets

• Image Processing/Machine Learning

Bad choice for serial or low latency jobs:

• For real-time processing

• For processing intensive task with little data

• Replacing MySQL

Who uses Hadoop?

Technology

MapReduce and Hadoop