HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HADOOP DISTRIBUTED FILE SYSTEMHDFS Reliability

Based on“The Hadoop Distributed File System”

K. Shvachko et al., MSST 2010

Michael Tsitrin

26/05/13

Topics• Introduction• HDFS Overview

• Basics• Architecture

• Data reliability• Block replicas

• NameNode reliability• NameNode failure• Journal• Checkpoint

• Conclusion

INTRODUCTION

Introduction• HDFS is a Cloud Based File Systems which allows

storage of large data sets on clusters of commodity hardware

• Huge number of components, each component has a non-trivial probability of failure• Hardware failure is the norm rather than the exception

• The purpose of this presentation is to present the techniques used in HDFS to keep the system and Data fully reliable.

HDFS OVERVIEW

HDFS Basics• An open-source implementation of distributed file system

based on Google File System

• Designed to store very large data sets reliably across large clusters of computers

• Optimized for MapReduce application• Large files, some several GB large• Reads are performed in a large streaming fashion• Large throughput instead of low latency

HDFS Architecture

Namenode

Breplication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

Block ops

HDFS NameSpace Node• The HDFS Namespcae Node keeps the metadata for

each data block in the system• Implemented as a single master server for a cluster• To achieve high performance, the entire namespace kept

in RAM• Manage the replication logic for the DataNodes• Serves clients with file block location for reads• metadata includes:

• Files and directories hierarchy• Permissions, modification time, etc• Mapping of file blocks to DataNodes

HDFS DataNode• A cluster can contain thousands of DataNodes• DataNode is where the actual File block is kept• User data divided into blocks and replicated across

DataNodes• A DataNode identifies block replicas in its possession to

the NameNode by sending a block report• DataNodes serves read, write requests, performs block

creation, deletion, and replication upon instruction from NameNode

DATA RELIABILITYBlock replicas

NameNode & Data Replication• All data-replication information is stored and managed by

the NameNode• The NameNode makes all decisions regarding replication

of blocks• It periodically receives a Heartbeat and a Blockreport from

each of the DataNodes in the cluster

• A Blockreport contains a list of all blocks on a DataNode• Receipt of a Heartbeat implies that the DataNode is functioning

properly• Datanodes without recent heartbeat marked as dead

Re-replication• The NameNode constantly tracks which blocks need to be

replicated and initiates replication whenever necessary• The necessity for re-replication may arise due to many

reasons• a DataNode may become unavailable• the replication factor of a file may be increased• a replica may become corrupted• a hard disk on a DataNode may fail

• Re-replication is fast because it is a parallel problem that scales with the size of the cluster• Lower the propability of block loss while replication is carried out

Replica placement• To protected against rack failure (as in power shortage),

the name node can manage replicas to be stored in different racks

• Beside the data reliability, this can also improve network bandwidth and client’s latency

• Common case (replication factor == 3):• Put one replica on one node in the local rack• Another on a different node in the local rack• The last on a different node in a different rack

• Doesn’t compromise data reliability and availability

NAMENODE RELIABILITY

NameNode Failure• NameNode is a Single Point of Failure for HDFS cluster

• If it becomes unavailable for clients, the whole cluster is unusable• Corruption / loss of metadata – Data blocks becomes unavailable

• NameNode keeps data on RAM – full data loss in case of power shortage• Needs a persistent solution

NameNode Persistence• The persistent record of the image stored in the local

host’s native files system is called a checkpoint

• The NameNode also stores the modification log of the image called the journal in the local host’s native file system

• For improved durability, redundant copies of the checkpoint and journal can be made at other servers

Journal• The journal is persistently record every change that

occurs to file system metadata (not including block mapping)

• Implemented as a write-ahead commit log for changes to the file system that must be persistent• To avoid being a bottleneck, few transactions are batched and

committed together

Checkpoint• Checkpoint is a persistent record of the NameNode’s

state written to disk• The checkpoint file is never changed by the NameNode

• Either a new checkpoint is created or a namespace is loaded from a previous checkpoint by the namenode

• When the NameNode starts, it performs the checkpoint process:• reads the current checkpoint and Journal from the disk• applies all the transactions from the Journal to the in-memory

representation of the namespace• flushes out this new version into a new checkpoint on disk• truncate the old Journal

Creating a Checkpoint• New checkpoint file can be created at startup only or

periodically• Creating a checkpoint emptying the journal:

• Long journal increase the probability of loss or corruption of the journal file

• Very large journal extends the time required to restart the NameNode

• To create periodic checkpoint, a dedicated server required (Checkpoint Node)• since it has the same memory requirements as the NameNode

CONCLUSION

Conclusion• HDFS has good reliability model, which can handle the

expected hardware failure

• While few techniques are in use to achieve namespace fault tolerance, it is still single point of failure in the system

• Many reliability parameters are configurable and can be changed to fit system demands• Replicas count• Rack scattering policy• Checkpoint and Journal redundancy

Documents

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13