Hadoop Distributed File System(HDFS) : Behind the scenes

Preview:

DESCRIPTION

The presentation sail you through the deep concepts of HDFS architecture, Where HDFS fits in Hadoop, What is HDFS Architecture and What is its role...

Citation preview

Hadoop Distributed File System :

Behind the Scenes

WHAT IS

WHERE HDFS FITS IN HADOOP?

LET’S FIRST UNDERSTAND BUZZWORDSIN THE HADOOP

WORLD

REPLICATION

FAULT TOLERANCE

LOAD BALANCING

RELIABILITY

CLUSTERING

IT’S TIME FOR DEEP DIVE…

HDFS ARCHITECTUREName NodeData NodeTask TrackerJob trackerImage and JournalHDFS ClientCheckpoint NodeBackup Node

Name Node

Job Tracker Checkpoint

Data Node 1 DataNode 2 DataNode N………..

Task Tracker Task Tracker Task Tracker

Backup Node

Image Journal

HDFS Client

NAME NODE

Job Tracker

Inode Image

Checkpoint

Journal

Inode - Files and directories are represented on the NameNode, which record attributes like permissions, modification and access times, namespace and disk space quotas.

Image - The inode data and the list of blocks

belonging to each file

Checkpoint - The persistent record of the image stored in the local host’s native file system

Journal - Write-ahead commit log for changes to the file system that must be persistent.

DATA NODE

On Start Up…

Data NodeNameNode

DATA NODE

Data Node Name Node

TotalStorage Capacity

Fraction Storage

#Data TransfersIn Progress

Commands

HDFS CLIENT

IMAGE & JOURNAL

Flush & Sync Operation

CHECKPOINT NODE

BACKUP NODE

FILE I/O OPERATIONS

Single Writer

Multiple Reader

DATA WRITE OPERATION

DN1

DN4

DN2

DN3

Client Name Node

client DN1 DN2 DN3

setup

packet1

packet2

packet3

packet4

packet5

close

DATA WRITE/READ OPERATION

DN1

Client Name Node

Single Writer Multiple Reader Model

Lease Management (Soft Limit and Hard Limit)

Pipelining, Buffering and Hflush

Checksum for data integrity

Choosing nodes for read operation

BLOCK PLACEMENT

DN1 DN2 DN3 DN4 DN5

RACK1

DN6 DN7 DN8 DN9 D10

RACK2

/

Journal

Inode Image

checkpoint

D11 D12 D13 D14 D15

RACK3

Client

Name NodeAdd(data)

Data Nodes for Replica

REPLICATION MANAGEMENT

DN1 DN2 DN3 DN4 DN5

RACK1

DN6 DN7 DN8 DN9 D10

RACK2

/

D11 D12 D13 D14 D15

RACK3

Over ReplicatedUnder Replicated

Journal

Inode Image

checkpoint

Name Node

Balancing the disk space utilization on individual data nodes.

Based on utilization threshold. Utilization balancing follows block placement

policy.

BALANCER

Scanner verifies the data integrity based on checksum.

SCANNER

Recommended