Configuring replication factor and block size in hdfs

ACADGILDACADGILD

In this post, we will be discussing how to configure replication factor, block size for the entirecluster, along with directory, and file in HDFS.

Hadoop Distributed File System (HDFS) stores files such as blocks, and distributes them

across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity

hardware, bocks are replicated several times to ensure high data availability.

Before going ahead, it is important to know basic information like, what is Replication factor,

blocks and block size. So, let’s get a clear picture of them first.

Blocks and Block Size:

HDFS is designed to store and process huge amounts of data and data sets. A typical block

size used by HDFS is about 64MB. We can also change the block size in Hadoop Cluster. All

blocks in a file, except the last block are of the same size. When you store a file in HDFS, the

system breaks it down into a set of individual blocks and stores these blocks in various slave

nodes in the Hadoop cluster.

Block Size Configuration for Entire Cluster:

If you want to set some specific block size for the entire cluster, you need to add a property

into hdfs-site.xml as shown below.

<property> <name>dfs.block.size<name> <value>134217728<value> <description>Block size<description> <property>

Here, we have set the dfs.block.size as 128MB. This will be applicable for the entire cluster.

Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all

the files placed into HDFS. Here, changing the block size will not affect the block size of any

files already in HDFS. It will only be applicable for those files which will be placed after this

setting takes effect.

https://acadgild.com/blog/wp-admin/post.php?post=9676&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=9676&action=edit

https://acadgild.com/blog/configuring-replication-factor-block-size-hdfs/fffffffffffffff/

ACADGILDACADGILD

Replication Factor:

The blocks of a file are replicated for fault tolerance. The block size and replication factor are

configurable per file. An application can specify the number of replicas of a file. The

replication factor can be specified at the time of creation of the file and can be changed later.

Files in HDFS are write-once and have strictly one writer at any time.

The replication factor is a property that can be set in the HDFS configuration file. It also

allows you to adjust the global replication factor for the entire cluster. For each block stored in

HDFS, there will be n – 1 duplicated blocks distributed across the cluster.

Example:

If you want to set 4 as the replication factor for the entire cluster, then you need to specify the

replication factor into the hdfs-site.xml.

<configuration><property><name>dfs.replication</name><value>4</value>  </property><property><name>dfs.namenode.name.dir</name><value>/home/acadgild/hadoop/namenode</value></property><property><name>dfs.datanode.data.dir</name><value>/home/acadgild/hadoop/datanode</value></property></configuration>


https://acadgild.com/blog/take-quiz-test-hadoop-knowledge-intermediate?utm_source=Blog(organic)&utm_medium=Blog%20article&utm_campaign=Big%20Data%20Landing%20Page

ACADGILDACADGILD

We can also change the replication factor on a file.

Let’s now create a new directory in HDFS root as shown below.

hadoop dfs -mkdir /test1/

You can verify this using the command-

hadoop fs -ls /


ACADGILDACADGILD

Now, let’s add a file into this directory.

hadoop dfs -put /home/acadgild/acadgild /test1/

Next, let’s try running the command to change the replication factor of a file in Hadoop

cluster. The command to this is as shown below:

hadoop fs –setrep –w 5 /test1/acadgild


ACADGILDACADGILD

We can also change the replication factor of all the files within a directory by using the below

command.

hadoop fs –setrep –w 3 -R /test/

We now have three files under this test directory. Therefore, it is considering the first file and

will replicate other files later on.

Note: Replication of individual files and directory takes time and it varies on various factor

like:

•Number of replication factor

•Size of files and directory

•Datanode Hardware

So, it’s better not to change replication factor for files basis and directory basis unless you

need it.

Hope this post has been helpful in understanding the steps to configure block size and

replication factor in HDFS. In case of queries, feel free to comment below and we will get back

to you at the earliest.


ACADGILDACADGILD

Keep visiting out website www.acadgild.com for more updates on Big Data and

other technologies.


http://www.acadguild.com/

https://acadgild.com/big-data/big-data-hadoop-administration?utm_source=Blog(organic)&utm_medium=Blog%20article&utm_campaign=Spark%20Landing%20Page

Education

Configuring replication factor and block size in hdfs