41
Apache Hadoop The elephant in the room C. Aaron Cois, Ph.D.

Hadoop: The elephant in the room

  • Upload
    cacois

  • View
    366

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Hadoop: The elephant in the room

Apache Hadoop

The elephant in the room

C. Aaron Cois, Ph.D.

Page 2: Hadoop: The elephant in the room

Me

@aaroncois

www.codehenge.net

Love to chat!

Page 3: Hadoop: The elephant in the room

The Problem

Page 4: Hadoop: The elephant in the room

Large-Scale Computation

• Traditionally, large computation was focused on– Complex, CPU-intensive calculations– On relatively small data sets

• Examples:– Calculate complex differential equations– Calculate digits of Pi

Page 5: Hadoop: The elephant in the room

Parallel Processing

• Distributed systems allow scalable computation (more processors, working simultaneously)

INPUT OUTPUT

Page 6: Hadoop: The elephant in the room

Data Storage

• Data is often stored on a SAN• Data is copied to each compute node

at compute time• This works well for small amounts of

data, but requires significant copy time for large data sets

Page 7: Hadoop: The elephant in the room

SAN

Compute Nodes

Data

Page 8: Hadoop: The elephant in the room

SAN

Calculating…

Page 9: Hadoop: The elephant in the room

You must first distribute data each time you run a computation…

Page 10: Hadoop: The elephant in the room

How much data?

Page 11: Hadoop: The elephant in the room

How much data?

over 25 PB of data

Page 12: Hadoop: The elephant in the room

How much data?

over 25 PB of data

over 100 PB of data

Page 13: Hadoop: The elephant in the room

The internet

IDC estimates[2] the internet contains at least:

1 Zetabyteor

1,000 Exabytesor

1,000,000 Petabytes

2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)

Page 14: Hadoop: The elephant in the room

How much time?

Disk Transfer Rates:• Standard 7200 RPM drive

128.75 MB/s=> 7.7 secs/GB=> 13 mins/100 GB=> > 2 hours/TB=> 90 days/PB

1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate

Page 15: Hadoop: The elephant in the room

How much time?

Fastest Network Xfer rate:• iSCSI over 1000GB ethernet (theor.)– 12.5 Gb/S => 80 sec/TB, 1333 min/PB

Ok, ignore network bottleneck:• Hypertransport Bus– 51.2 Gb/S => 19 sec/TB, 325 min/PB

1 http://en.wikipedia.org/wiki/List_of_device_bit_rates

Page 16: Hadoop: The elephant in the room

We need a better plan

• Sending data to distributed processors is the bottleneck

• So what if we sent the processors to the data?

Core concept: Pre-distribute and store the data.Assign compute nodes to operate on local data.

Page 17: Hadoop: The elephant in the room

The Solution

Page 18: Hadoop: The elephant in the room

Distributed Data Servers

Page 19: Hadoop: The elephant in the room

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

Distribute the Data

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

Page 20: Hadoop: The elephant in the room

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

Send computation code to servers containing relevant data

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

0101100110

10

Page 21: Hadoop: The elephant in the room

Hadoop Origin

• Hadoop was modeled after innovative systems created by Google

• Designed to handle massive (web-scale) amounts of data

Fun Fact: Hadoop’s creator named it after his son’s stuffed elephant

Page 22: Hadoop: The elephant in the room

Hadoop Goals

• Store massive data sets • Enable distributed computation

• Heavy focus on – Fault tolerance – Data integrity– Commodity hardware

Page 23: Hadoop: The elephant in the room

Hadoop System

GFS

MapReduce

BigTable

HDFS

Hadoop MapReduce

HBase

Page 24: Hadoop: The elephant in the room

Hadoop System

GFS

MapReduce

BigTable

HDFS

Hadoop MapReduce

HBase

Hadoop

Page 25: Hadoop: The elephant in the room

Components

Page 26: Hadoop: The elephant in the room

HDFS

• “Hadoop Distributed File System”• Sits on top of native filesystem– ext3, etc

• Stores data in files, replicated and distributed across data nodes

• Files are “write once”• Performs best with millions of

~100MB+ files

Page 27: Hadoop: The elephant in the room

HDFS

Files are split into blocks for storage

Datanodes– Data blocks are distributed/replicated

across datanodes

Namenode – The master node– Keeps track of location of data blocks

Page 28: Hadoop: The elephant in the room

HDFS

Multi-Node Cluster

Master Slave

Name Node

Data NodeData Node

Page 29: Hadoop: The elephant in the room

MapReduce

A programming model– Designed to make programming parallel

computation over large distributed data sets easy

– Each node processes data already residing on it (when possible)

– Inspired by functional programming map and reduce functions

Page 30: Hadoop: The elephant in the room

MapReduce

JobTracker– Runs on a master node– Clients submit jobs to the JobTracker– Assigns Map and Reduce tasks to slave

nodes

TaskTracker– Runs on every slave node– Daemon that instantiates Map or Reduce

tasks and reports results to JobTracker

Page 31: Hadoop: The elephant in the room

MapReduce

Multi-Node Cluster

Master Slave

JobTracker

TaskTrackerTaskTracker

Page 32: Hadoop: The elephant in the room

MapReduce Layer

HDFS Layer

Multi-Node Cluster

Master Slave

NameNode

DataNodeDataNode

JobTracker

TaskTracker

TaskTracker

Page 33: Hadoop: The elephant in the room

HBase

• Hadoop’s Database• Sits on top of HDFS• Provides random read/write access to

Very LargeTM tables– Billions of rows, billions of columns

• Access via Java, Jython, Groovy, Scala, or REST web service

Page 34: Hadoop: The elephant in the room

A Typical Hadoop Cluster

• Consists entirely of commodity ~$5k servers

• 1 master, 1 -> 1000+ slaves• Scales linearly as more processing

nodes are added

Page 35: Hadoop: The elephant in the room

How it works

Page 36: Hadoop: The elephant in the room

http://en.wikipedia.org/wiki/MapReduce

Traditional MapReduce

Page 37: Hadoop: The elephant in the room

Hadoop MapReduce

Image Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854

Page 38: Hadoop: The elephant in the room

MapReduce Example

function map(Str name, Str document):for each word w in document:

increment_count(w, 1) function reduce(Str word, Iter partialCounts):

sum = 0for each pc in partialCounts:

sum += ParseInt(pc)return (word, sum)

Page 39: Hadoop: The elephant in the room

What didn’t I worry about?

• Data distribution• Node management• Concurrency• Error handling• Node failure• Load balancing• Data replication/integrity

Page 40: Hadoop: The elephant in the room

Demo

Page 41: Hadoop: The elephant in the room

Try the demo yourself!

Go to:

https://github.com/cacois/vagrant-hadoop-cluster

Follow the instructions in the README