Managing a Hadoop Cluster

Managing Hadoop Cluster

Topology of a typical Hadoop cluster .

Installation Steps

Installed java

ssh and sshd

gunzip hadoop-0.18.0.tar.gz

Or tar vxf hadoop-0.18.0.tar Or tar vxf hadoop-0.18.0.tar

Set JAVA_HOME in conf/hadoop-env.sh

Modified hadoop-site.xml

Hadoop Installation Flavors

Standalone

Pseudo-distributed

Hadoop clusters of multiple nodes

Additional Configuration

conf/masters

contains the hostname of the SecondaryNameNode

It should be fully-qualified domain name.

conf/slaves conf/slaves

the hostname of every machine in the cluster which

should start TaskTracker and DataNode daemons

Ex:slave01

slave02

slave03

Advance Configuration

enable passwordless ssh

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

The ~/.ssh/id_dsa.pub and authorized_keys

files should be replicated on all machines in

the cluster.

Advance Configuration

Various directories should be created on each

node

The NameNode requires the NameNode metadata

directorydirectory

$ mkdir -p /home/hadoop/dfs/name

Every node needs the Hadoop tmp directory and

DataNode directory created

Advance Configuration..

bin/slaves.sh allows a command to be

executed on all nodes in the slaves file. $ mkdir -p /tmp/hadoop

$ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf

$ export HADOOP_SLAVES=${HADOOP_CONF_DIR}/slaves

$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /tmp/hadoop"

$ ${HADOOP_HOME}/bin/slaves.sh "mkdir -p /home/hadoop/dfs/data

Format HDFS

$ bin/hadoop namenode -format

start the cluster:

$ bin/start-all.sh

Selecting Machines

Hadoop is designed to take advantage of

whatever hardware is available

Hadoop jobs written in Java can consume

between 1 and 2 GB of RAM per corebetween 1 and 2 GB of RAM per core

If you use HadoopStreaming to write your jobs

in a scripting language such as Python, more

memory may be advisable.

Cluster Configurations

Small Clusters: 2-10 Nodes

Medium Clusters: 10-40 Nodes

Large Clusters: Multiple Racks

Small Clusters: 2-10 Nodes

In two nodes,

one node: NameNode/JobTracker and a

DataNode/TaskTracker;

the other node: DataNode/TaskTracker. the other node: DataNode/TaskTracker.

Clusters of three or more machines typically

use a dedicated NameNode/JobTracker, and

all other nodes are workers.

configuration in conf/hadoop-site.xml

mapred.job.trackerhead.server.node.com:9001

fs.default.name

hdfs://head.server.node.com:9000

hadoop.tmp.dir/tmp/hadooptrue

mapred.system.dir/hadoop/mapred/systemtrue

ue>

dfs.data.dir/home/hadoop/dfs/datatrue

dfs.name.dir/home/hadoop/dfs/nametrue

dfs.replication2

Medium Clusters: 10-40 Nodes

The single point of failure in a Hadoop cluster

is the NameNode

Hence, back up the NameNode metadata.

One machine in the cluster should be designated One machine in the cluster should be designated

as the NameNode's backup

It does not run the normal Hadoop daemons

it exposes a directory via NFS which is only

mounted on the NameNode

NameNodes backup

The cluster's hadoop-site.xml file should then

instruct the NameNode to write to this

directory as well:

dfs.name.dir

/home/hadoop/dfs/name,/mnt/namenode-backup

true

conf/hadoop-site.xml

Nodes must be decommissioned on a schedule that permits replication of blocks being decommissioned.

conf/hadoop-site.xml

dfs.hosts.exclude/home/hadoop/excludes/home/hadoop/excludestrue

mapred.hosts.exclude/home/hadoop/excludestrue

create an empty file with this name: $ touch /home/hadoop/excludes

Replication Setting

dfs.replication

3

Tutorial

Configure Hadoop Cluster in two nodes.

Tutorial-Installed Hadoop in Cluster.docx

Performance Monitoring

Ganglia

Nagios

Ganglia

performance monitoring framework for

distributed systems

collects metrics on individual machines and

forwards them to an aggregatorforwards them to an aggregator

designed to be integrated into other

applications

Ganglia

Installed and configured Ganglia

create a file named hadoop-metrics.propertiesin the $HADOOP_HOME/conf directorydfs.class=org.apache.hadoop.metrics.ganglia.GangliaContextdfs.period=10dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContextdfs.period=10dfs.servers=localhost:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContextmapred.period=10mapred.servers=localhost:8649

Nagios

a machine and service monitoring system

designed for large clusters

provide useful diagnostic information for

tuning your cluster, including network, disk, tuning your cluster, including network, disk,

and CPU utilization across machines.

Tutorial

Installed Ganglia /Nagios and monitor Hadoop

Tutorial-MonitorHadoopWithGanglia.docx

Documents

Managing a Hadoop Cluster