Hadoop 101

Introducing:

The Modern Data Operating System

Hadoop is ...A scalable fault tolerant distributed for data storage and processing (open source under the Apache license)

● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage

● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

- Core Hadoop has two main systems:

GFS

Map/Reduce

BigTable

> > >

> > >

> > >

HDFS

MapReduce

Hadoop Origins

Hadoop Chronicles

GFS

Map/Reduce

BigTable

Doug Cutting

Etymology

● Hadoop was created in 2004 by "Douglass (Doug) Cutting"

● Implemented Google Filesystem and Big Tables papers

● He aimed it, to index the internet in google style for startup search engine 'Nutch'

● Named it after his son's elephant shaped favourite toy named hadoop

What is Big Data?

"In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools."

Wikipedia

● 2008: Google processes 20PB a day

● 2012: Facebook ingests 500TB of data a day

● 2009: eBay has 6.5 PB user data + 50 TB a day

● 2011: Yahoo! has 180-200 PB of data

How big is big?

Limitations of Existing Analytics Architecture

BI Reports + Online Apps

RDBMS (aggregated data)

ETL (Extract, Transfer & Load)

Storage Grid

Data Collection

Instrumentation (Raw Data Sources)

Moving Data from storage to compute doesn't scale!

Can't explore original raw data

Archiving = Premature deathMostly Append

Why Hadoop?Challenge: Read 1 TB of data

1 Machine

- 4 IO channels- Each channel: 100 MB/s

?

10 Machines

- 4 IO channels- Each channel: 100 MB/s

?45 minutes 4.5 minutes

Hadoop and Friends

The Key Benefit: Agility/FlexibilitySchema-On-Write (RDBMS) Schema-On-Read (Hadoop)

- Schema must be created before any data can be loaded

- An explicit load operation has to take place which transforms data to DB internal structure

- New columns must be be added explicitly before new data for such columns can be loaded into the database

- Reads are fast - Load is fast- Standards / Governance - Flexibility / Agility

- Data is simply copied to the file store, no transformations are needed

- A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding)

- New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it

Hadoop ComponentsMaster/Slave Architecture

Name Node

Job Tracker

Data Nodes

Task Trackers

File metadata:/kenshoo/data1.txt ---> 1,2,3/kenshoo/data2.txt ---> 4,5

NameNode r=3

hdfs-site.xml

dfs.replication

Data Nodes

3 3 3 1

2 2 1 1 2

5 4

5 4 5

4

Underlying FS options

ext3- released in 2001 - Used by Yahoo!- bootstrap + format slow- set:

- noatime- tune2fs (to turn off reserved blocks)

ext4- released in 2008 - Used by Google- Fast as XFS- set:

- delayed allocation off-noatime- tune2fs (to turn off reserved blocks)

XFS- released in 1993 - Fast- Drawbacks:

- deleting large # of files

Sample HDFS shell Commands

bin/hadoop fs -lsbin/hadoop fs -mkdirbin/hadoop fs -copyFromLocalbin/hadoop fs -copyToLocalbin/hadoop fs -moveToLocalbin/hadoop fs -rmbin/hadoop fs -tailbin/hadoop fs -chmodbin/hadoop fs -setrep -w 4 -R /dir1/s-dir

Mounting using FUSE:

hadoop-fuse-dfs dfs://10.73.9.50 /hdfs

Name Node

2

4

3

5

Rack 1

2

4

3

5

Rack 2

2

4

3

5

Rack 3

Job Tracker HBase Master

Network Topology

Yahoo! Installation

- 8 core switches- 100 racks- 40 servers/rack- 1 GBit in rack- 10 GBit among racks-Total 11PB

Name Node

2

4

3

5

Rack 1

7

9

8

10

Rack 2

12

14

13

15

Rack 3

Job Tracker HBase Master

Rack Awareness

metadata

file.txt = Blk A: DN: 2,7,8

Blk B:DN: 9,12,14

NameNode

A

B

B

BB

A A

A

2

4

3

5

Rack 1

7

9

8

10

Rack 2

HDFS Writes

metadata


NameNode

A

A

A B

Core

C

Client

A

A

2

4

3

5

Rack 1

7

9

8

10

Rack 2

12

14

13

15

Rack 3

Reading Files

metadata


Blk B:DN: 9,12,14

NameNode

A

BB

BB

A A

A

Client

File1.txt parts:Blk A: 2,7,8Blk B: 9,12,14

Core

wanna read file1.txt