21
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Embed Size (px)

Citation preview

Page 1: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HADOOP DISTRIBUTED FILE SYSTEMHDFS Reliability

Based on“The Hadoop Distributed File System”

K. Shvachko et al., MSST 2010

Michael Tsitrin

26/05/13

Page 2: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Topics• Introduction• HDFS Overview

• Basics• Architecture

• Data reliability• Block replicas

• NameNode reliability• NameNode failure• Journal• Checkpoint

• Conclusion

Page 3: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

INTRODUCTION

Page 4: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Introduction• HDFS is a Cloud Based File Systems which allows

storage of large data sets on clusters of commodity hardware

• Huge number of components, each component has a non-trivial probability of failure• Hardware failure is the norm rather than the exception

• The purpose of this presentation is to present the techniques used in HDFS to keep the system and Data fully reliable.

Page 5: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HDFS OVERVIEW

Page 6: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HDFS Basics• An open-source implementation of distributed file system

based on Google File System

• Designed to store very large data sets reliably across large clusters of computers

• Optimized for MapReduce application• Large files, some several GB large• Reads are performed in a large streaming fashion• Large throughput instead of low latency

Page 7: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HDFS Architecture

Namenode

Breplication

Rack1 Rack2

Client

Blocks

Datanodes Datanodes

Client

Write

Read

Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..

Block ops

Page 8: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HDFS NameSpace Node• The HDFS Namespcae Node keeps the metadata for

each data block in the system• Implemented as a single master server for a cluster• To achieve high performance, the entire namespace kept

in RAM• Manage the replication logic for the DataNodes• Serves clients with file block location for reads• metadata includes:

• Files and directories hierarchy• Permissions, modification time, etc• Mapping of file blocks to DataNodes

Page 9: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

HDFS DataNode• A cluster can contain thousands of DataNodes• DataNode is where the actual File block is kept• User data divided into blocks and replicated across

DataNodes• A DataNode identifies block replicas in its possession to

the NameNode by sending a block report• DataNodes serves read, write requests, performs block

creation, deletion, and replication upon instruction from NameNode

Page 10: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

DATA RELIABILITYBlock replicas

Page 11: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

NameNode & Data Replication• All data-replication information is stored and managed by

the NameNode• The NameNode makes all decisions regarding replication

of blocks• It periodically receives a Heartbeat and a Blockreport from

each of the DataNodes in the cluster

• A Blockreport contains a list of all blocks on a DataNode• Receipt of a Heartbeat implies that the DataNode is functioning

properly• Datanodes without recent heartbeat marked as dead

Page 12: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Re-replication• The NameNode constantly tracks which blocks need to be

replicated and initiates replication whenever necessary• The necessity for re-replication may arise due to many

reasons• a DataNode may become unavailable• the replication factor of a file may be increased• a replica may become corrupted• a hard disk on a DataNode may fail

• Re-replication is fast because it is a parallel problem that scales with the size of the cluster• Lower the propability of block loss while replication is carried out

Page 13: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Replica placement• To protected against rack failure (as in power shortage),

the name node can manage replicas to be stored in different racks

• Beside the data reliability, this can also improve network bandwidth and client’s latency

• Common case (replication factor == 3):• Put one replica on one node in the local rack• Another on a different node in the local rack• The last on a different node in a different rack

• Doesn’t compromise data reliability and availability

Page 14: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

NAMENODE RELIABILITY

Page 15: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

NameNode Failure• NameNode is a Single Point of Failure for HDFS cluster

• If it becomes unavailable for clients, the whole cluster is unusable• Corruption / loss of metadata – Data blocks becomes unavailable

• NameNode keeps data on RAM – full data loss in case of power shortage• Needs a persistent solution

Page 16: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

NameNode Persistence• The persistent record of the image stored in the local

host’s native files system is called a checkpoint

• The NameNode also stores the modification log of the image called the journal in the local host’s native file system

• For improved durability, redundant copies of the checkpoint and journal can be made at other servers

Page 17: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Journal• The journal is persistently record every change that

occurs to file system metadata (not including block mapping)

• Implemented as a write-ahead commit log for changes to the file system that must be persistent• To avoid being a bottleneck, few transactions are batched and

committed together

Page 18: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Checkpoint• Checkpoint is a persistent record of the NameNode’s

state written to disk• The checkpoint file is never changed by the NameNode

• Either a new checkpoint is created or a namespace is loaded from a previous checkpoint by the namenode

• When the NameNode starts, it performs the checkpoint process:• reads the current checkpoint and Journal from the disk• applies all the transactions from the Journal to the in-memory

representation of the namespace• flushes out this new version into a new checkpoint on disk• truncate the old Journal

Page 19: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Creating a Checkpoint• New checkpoint file can be created at startup only or

periodically• Creating a checkpoint emptying the journal:

• Long journal increase the probability of loss or corruption of the journal file

• Very large journal extends the time required to restart the NameNode

• To create periodic checkpoint, a dedicated server required (Checkpoint Node)• since it has the same memory requirements as the NameNode

Page 20: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

CONCLUSION

Page 21: HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13

Conclusion• HDFS has good reliability model, which can handle the

expected hardware failure

• While few techniques are in use to achieve namespace fault tolerance, it is still single point of failure in the system

• Many reliability parameters are configurable and can be changed to fit system demands• Replicas count• Rack scattering policy• Checkpoint and Journal redundancy