Getting Started With Hadoop

Getting Started with Hadoop 1

Getting Started With Hadoop By Reeshu Patel

2 Getting Started with Hadoop

Hadoop

Hadoop, and large-scale distributed data processing in general,is rapidly becoming an important skill set for many programmers. An effective programmer, today, must have knowledge of relational databases, networking, and security, all of which were considered optional skills a couple decades ago. Similarly, basic understanding of distributed data processing will soon become an essential part of every programmer? toolbox.

Leading universities, such as Stanford and CMU, have already started introducing Hadoop into their computer science curriculum. This book will help you, the practi -cing programmer, get up to speed on Hadoop quickly and start using it to process your data sets .

We introduces Hadoop more formally, positioning it in terms of distributed systems and data processing systems. It gives an overview of the MapReduce programming model. A simple word counting example with existing tools highlights the challenges around processing data at large scale.

You will implement that example using Hadoop to gain a deeper appreciation of Hadoop simplicity. We will also discuss the history of Hadoop and some perspectives on the MapReduce paradigm. But let me first briefly explain why I wrote this book and why it useful to you.


Why hadoop

Speaking from experience, I first found Hadoop to be tantalizing in its possibilities, yet frustrating to progress beyond coding the basic examples. The documentation at the official Hadoop site is fairly comprehensive, but it is not always easy to find straightfor ward answers to straightforward questions. The purpose of writing the book is to address this problem. I wont focus on the Inittygritty details. Instead I will provide the information that will allow you to quickly create useful code, along with more advanced topics most often encountered in practice.

Comparing SQL databases and Hadoop

Given that Hadoop is a framework for processing data, what makes it better than standard relational databases, the workhorse of data processing in most of today Applications. One reason is that SQL (structured query language) is by design targeted at structured data. Many of Hadoop initial applications deal with unstructured data such as text. From this perspective Hadoop provides a more general paradigm than SQL or working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine. 3 But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.

With that in mind, lets make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.


Scaling

Before going through a formal treatment of MapReduce, let? go through an exercise of scaling a simple program to process a large data set. You?l see the challenges of scaling a data processing program and will better appreciate the benefits of using a framework such as MapReduce to handle the tedious chores for you

Our exercise is to count the number of times each word occurs in a set of documents. In this example, we have a set of documents having only one document with only one sentence:

Do as I say, not as I do. We derive the word counts shown to the right. We?l call this particular exercise word counting. When the set of documents is small, a straightforward program will do the job

WordCountas 2do 2i2not 1say 1

What is Hadoop

Formally speaking, Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is

1:-Accessible

2:-Robust

3:-Scalable

4:-Simple


Hadoop accessibility and simplicity give it an edge over writing and running large distributed programs. Even college students can quickly and cheaply create their own Hadoop cluster. On the other hand, its robustness and scalability make it suitable for even the most demanding jobs at Yahoo and Facebook. These features make Hadoop popular in both academia and industry.

Understanding distributed systems and hadoop

To understand the popularity of distributed systems (scale-out) vis-vis huge monolithic servers (scale-up), consider the price performance of current I/O technology. A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the clu ster via the Hadoop Distributed File System (HDFS).With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. And such a cluster of commodity machines turns out to be cheaper than one high-end server.

Understanding MapReduce

You are probably aware of data processing models such as pipelines and message queues. These models provide specific capabilities in developing different aspects of data processing applications. The most familiar pipelines are the Unix pipes. Pipelines can help the reuse of processing primitives; simple chaining of existing modules creates new ones. Message queues can help the synchronization of processing primitives.


The programmer writes her data processing task as processing primitives in the form of either a producer or a consumer. The timing of their execution is managed by the system.

Similarly, MapReduce is also a data processing model. Its greatest advantage is the easy scaling of data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once you write an application in the MapReduce form, scaling the application to run .

over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to the MapReduce model.

Starting Hadoop

If you work in an environment where someone else sets up the Hadoop cluster for you, you may want to skim through this chapter. You want to understand enough to set up your personal development machine, but you can skip through the details of configuring the communication and coordination of various nodes.

After discussing the physical components of Hadoop in section. we progress to setting up your cluster in sections and Next Section will focus on the three operational modes of Hadoop and how to set them up. You?l read about web-based tools that assist monitoring your cluster in next section .


The building blocks of Hadoop

We are discussed the concepts of distributed storage and distributed computation in the previous chapter. Now let see how Hadoop implements those ideas. On a fully configured cluster, running Hadoop means running a set of daemons, or resident programs, on the different servers in your network. These daemons have specific roles; some exist only on one server, some exist across multiple servers. The daemons include

1:-NameNode

2:-DataNode

3:-Secondary NameNode

4:-JobTracker

5:-TaskTracker

We discuss each one and its role within Hadoop.

1:-NameNode

Let begin with arguably the most vital of the Hadoop daemons he NameNode. Hadoop employs a master/slave architecture for both distributed storage and distributed computation. The distributed storage system is called the Hadoop File System, or HDFS. The NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks. The NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file


blocks, which nodes store those blocks, and the overall health of the distributed file system.

The function of the NameNode is memory and I/O intensive. As such, the server hosting the NameNode typically doesn store any user data or perform any computations for a MapReduce program to lower the workload on the machine. This means that the NameNode server doesn double as a DataNode or a TaskTracker.

There is unfortunately a negative aspect to the importance of the NameNode it a single point of failure of your Hadoop cluster. For any of the other daemons, if their host nodes fail for software or hardware reasons, the Hadoop cluster will likely continue to function smoothly or you can quickly restart it. Not so for the NameNode.

2:-DataNode

Each slave machine in your cluster will host a DataNode daemon to perform the grunt work of the distributed file system reading and writing HDFS blocks to actual files on the local file system. When you want to read or write a HDFS file, the file is broken into blocks and the NameNode will tell your client which DataNode each block resides in. Your client communicates directly with the DataNode daemons to process the local files corresponding to the blocks. Furthermore, a DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.


Figure 1.1 illustrates the roles of the NameNode and DataNodes. In this figure, we show two data files, one at /user/chuck/data1 and another at /user/james/data2. The data1 file takes up three blocks, which we denote 1, 2, and 3, and the data2 file consists of blocks 4 and 5. The content of the files are distributed among the DataNodes. In this illustration, each block has three replicas. For example, block 1 (used for data1) is replicated over the three rightmost DataNodes. This ensures that if any one DataNode crashes or becomes inaccessible over the network, you will still be able to read the files.

3:-Secondary NameNode

The Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS. Like the NameNode, each cluster has one SNN, and it typically resides on its own machine as well. No other DataNode or TaskTracker daemons run on the same server. The SNN differs from the NameNode in that this process doesn receive or record any real-time changes to HDFS. Instead, it communicates with the NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.

As mentioned earlier, the NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data. Nevertheless, a NameNode failure requires human intervention to reconfigure the


cluster to use the SNN as the primary NameNode. We will discuss the recovery process in chapter 8 when we cover best practices for managing your cluster.

4:-JobTracker

The JobTracker daemon is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they are running. Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. There is only one JobTracker daemon per Hadoop cluster. It typically run on a server as a master node of the cluster.

5:-TaskTracker

As with the storage daemons, the computing daemons also follow a master/slave architecture: the JobTracker is the master overseeing the overall execution of a MapReduce job and the TaskTrackers manage the execution of individual tasks on each slave node. Figure 2.2 illustrates this interaction.

Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns. Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel.

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within


a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

One responsibility of the TaskTracker is to constantly communicate with the JobTracker. If the JobTracker fails to receive a heartbeat from a TaskTracker within a specified amount of time, it will assume the TaskTracker has crashed and will resubmit the corresponding tasks to other nodes in the cluster.

Figure 2.2 JobTracker and TaskTracker interaction. After a client calls the

JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster.

Define a common account

We are been speaking in general terms of one node accessing another; more precisely this access is from a user account on one node to another user account on the target machine. For Hadoop, the accounts should have the same username on all of the nodes (we use hadoop-user in this book), and for security purpose we recommend it being a user-level account. This account is only for managing your


Hadoop cluster. Once the cluster daemons are up and running, you be able to run your actual MapReduce jobs from other accounts.

Verify SSH installation

The first step is to check whether SSH is installed on your nodes. We can easily do this by use of the "which" UNIX command:

[hadoop-user@master]$ which ssh

/usr/bin/ssh

[hadoop-user@master]$ which sshd

/usr/bin/sshd

[hadoop-user@master]$ which ssh-keygen

/usr/bin/ssh-keygen

If you instead receive an error message such as this,

/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...

Install OpenSSH (www.openssh.com) via a Linux package manager or by downloading the source directly. (Better yet, have your system administrator do it for you.)

Generate SSH key pair


Having verified that SSH is correctly installed on all nodes of the cluster, we use ssh keygen on the master node to generate an RSA key pair. Be certain to avoid entering a passphrase, or you?l have to manually enter that phrase every time the master node attempts to access another node.

[hadoop-user@master]$ ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hadoop-user/.ssh/id_rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.

Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub.

After creating your key pair, your public key will be of the form

[hadoop-user@master]$ more /home/hadoop-user/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA1WS3RG8LrZH4zL2/1oYgkV1OmVclQ2OO5vRi0Nd K51Sy3wWpBVHx82F3x3ddoZQjBK3uvLMaDhXvncJG31JPfU7CTAfmtgINYv0kdUbDJq4TKG/fuO5q

J9CqHV71thN2M310gcJ0Y9YCN6grmsiWb2iMcXpy2pqg8UM3ZKApyIPx99O1vREWm+4moFTg


YwIl5be23ZCyxNjgZFWk5MRlT1p1TxB68jqNbPQtU7fIafS7Sasy7h4eyIy7cbLh8x0/V4/mcQsY

5dvReitNvFVte6onl8YdmnMpAh6nwCvog3UeWWJjVZTEBFkTZuV1i9HeYHxpm1wAzcnf7az78jT

IRQ== hadoop-user@master

and we next need to distribute this public key across your cluster.

Distribute public key and validate logins

Albeit a bit tedious, you will next need to copy the public key to every slave node as well as the master node:

[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub :~/master_key

Manually log in to the target node and set the master key as an authorized key (or append to the list of authorized keys if you have others defined).

After generating the key, you can verify it? correctly defined by attempting to log in to the target node from the master

[hduser@master]$ ssh

The authenticity of host 'target (xxx.xxx.xxx.xxx)' can? be established.

RSA key fingerprint is 72:31:d8:1b:11:36:43:52:56:11:77:a4:ec:82:03:1d.

Are you sure you want to continue connecting (yes/no)? yes


Warning: Permanently added 'target' (RSA) to the list of known hosts.

Last login: Sun Jan 4 15:32:22 2009 from master

After confirming the authenticity of a target node to the master node, you won? be prompted upon subsequent login attempts.

[hduser@master]$ ssh target

Last login: Sun Jan 4 15:32:49 2009 from master

We are now set the groundwork for running Hadoop on your own cluster. Let discuss the different Hadoop modes you might want to use for your projects.

Running Hadoop

We need to configure a few things before running Hadoop

The first thing you need to do is to specify the location of Java on all the nodes includ ing the master. In hadoop-env.sh define the JAVA_HOME environment variable to point to the Java installation directory. On our servers, we?e it defined as

export JAVA_HOME=/usr/share/jdk

If you followed the examples in chapter 1, you?e already completed this step.) The hadoop-env.sh file contains other variables for defining your Hadoop environment, but JAVA_HOME is the only one requiring initial modification. The default settings on the other variables will probably work fine. As you become more familiar with Hadoop you can later modify this file to suit your individual needs (logging directory location, Java class path, and so on).


The majority of Hadoop settings are contained in XML configuration files. Before version 0.20, these XML files are hadoop-default.xml and hadoop-site.xml. As the names imply, hadoop-default.xml contains the default Hadoop settings to be used unless they are explicitly overridden in hadoop-site.xml. In practice you only deal with hadoop-site.xml. In version 0.20 this file has been separated out into three XML files: core-site.xml, hdfs-site.xml, and mapred-site.xml. This refactoring better aligns the configuration settings to the subsystem of Hadoop that they control. In the rest of this here we will generally point out which of the three files used to adjust a configuration setting. If you use an earlier version of Hadoop, keep in mind that all such configuration settings are modified in hadoop-site.xml

Local (standalone) mode

The standalone mode is the default mode for Hadoop. When you first uncompress the Hadoop source package, it ignorant of your hardware setup. Hadoop chooses to be conservative and assumes a minimal configuration. All three XML files (or hadoop site.xml before version 0.20) are empty under this default mode:

With empty configuration files, Hadoop will run completely on the local machine. Because there no need to communicate with other nodes, the standalone mode


doesn use HDFS, nor will it launch any of the Hadoop daemons. Its primary use is for developing and debugging the application logic of a MapReduce program without the additional complexity of interacting with the daemons. When you ran the example MapReduce program in chapter 1, you were running it in standalone mode.

Pseudo-distributed mode

The pseudo-distributed mode is running Hadoop in a cluster of onewith all daemons running on a single machine. This mode complements the standalone mode for debugging your code, allowing you to examine memory usage, HDFS input/out put issues, and other daemon interactions. Listing 2.1 provides simple XML files to configure a single server in this mode.

Listing 2.1

Example of the three configuration files for pseudo-distributed mode

core-site.xml

fs.default.name


hdfs://localhost:9000

The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation.

mapred-site.xml

mapred.job.tracker

localhost:9001

The host and port that the MapReduce job tracker runs at.


hdfs-site.xml

dfs.replication

1

The actual number of replications can be specified when the file is created.

Setting up Hadoop on your machine: This recipe describes how to run Hadoop in the local mode.


First We will be Getting ready for Installation Java On our PC:

Download and install Java 1.7 or higher version from httpwww.oracletechnetwork/java/javase/downloads/index.html.

You can use command for java installation

$ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:ferramroberto/java # Update the source list $ sudo apt-get update # Install Sun Java 6 JDK $ sudo apt-get install sun-java6-jdk # Select Sun's Java as the default on your machine. # See 'sudo update-alternatives --config java' for more information $ sudo update-java-alternatives -s java-6-sun

After installation, make a quick check wheather Sun JDK is correctly setup:

1:- user@ubuntu:~# java ?ersion 2:- java version "1.6.0_20 3:- Java(TM) SE Runtime Environment (build 1.6.0_20-b02)


4:- Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

Now let us do the Hadoop installation:

1:-Download the most recent Hadoop 1.0 branch distribution from http://hadoop.apache.org/. 2:-Unzip the Hadoop distribution using the following command. You will have to change the x.x in the filename with the actual release you have downloaded. If you are using Windows, you should use your favorite archive program such as WinZip or WinRAR for extracting the distribution. From this point onward, we shall call the unpacked Hadoop directory HADOOP_HOME. >tar hadoop-1.0..03.tar.gz

For Adding Hadoop Using Following Command 1:- $ sudo addgroup hadoop

2:- $ sudo adduser --ingroup hadoop hduser

This will add the user hduser and the group hadoop to your local machine.

Configuring SSH Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.


1:- user@ubuntu:~$ su - hduser 2:- hduser@ubuntu:~$ ssh-keygen -t rsa -P "" 3:- Generating public/private rsa key pair. 4:- Enter file in which to save the key (/home/hduser/.ssh/id_rsa): 5:- Created directory '/home/hduser/.ssh'.

Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...] 7:-hduser@ubuntu:~$ The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don? want to enter the passphrase every time Hadoop interacts with its nodes).

1:-hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys The final step is to test the SSH setup by connecting to your local machine with the hduser user.

1:- hduser@ubuntu:~$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes


Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...]

2:hduser@ubuntu:~$

Disabling Ipv6

One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box. In my case, I realized that there? no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary. To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 You have to reboot your machine in order to make the changes take effect.You can check whether IPv6 is enabled on your machine with the following command:

1:-$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that? what we want).

You can also disable IPv6 only for Hadoop as documented in Hadoop-3437 You can do so by adding the following line to conf/hadoop-env.sh:


conf/hadoop-env.sh

export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Hadoop Installation Downlode Hadoop from Apache Downlode mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

$ cd /usr/local $ sudo tar xzf hadoop-1.0.3.tar.gz $ sudo mv hadoop-1.0.3 hadoop $ sudo chown -R hduser:hadoop hadoop

(Just to give you the idea, YMMV personally, I create a symlink from hadoop-1.0.3 to hadoop.)

Now we Update $HOME/.bashrc Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-sun

unalias fs &> /dev/null

alias fs="hadoop fs"


unalias hls &> /dev/null alias hls="fs -l

export PATH=$PATH:$HADOOP_HOME/bin

How it works Hadoop local mode does not start any servers but does all the work within the same JVM.When you submit a job to Hadoop in the local mode, that job starts a JVM to run the job,and that JVM carries out the job. The output and the behavior of the job is the same as a distributed Hadoop job, except for the fact that the job can only use the current node forrunning tasks. In the next recipe, we will discover how to run a MapReduce program using the unzipped Hadoop distribution.

HDFS Concepts HDFS is the distributed filesystem that is available with Hadoop. MapReduce tasks use HDFS to read and write data. HDFS deployment includes a single NameNode and multiple DataNodes.For the HDFS setup, we need to configure NameNodes and DataNodes, and then specify the DataNodes in the slaves file. When we start the NameNode, startup script will start the DataNodes.

The Hadoop Distributed File System Architecture and Design

The following picture gives an overview of the most important HDFS components.


Configuration Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the hadoop.

hadoop-env.sh The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Change conf/hadoop-env.sh # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To conf/hadoop-env.sh


# The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun

Note: If you are on a Mac with OS X 10.7 you can use the following line to set up JAVA_HOME in conf/hadoop-env.sh. export JAVA_HOME=`/usr/libexec/java_home

Now we create the directory and set the required ownerships and permissions You can leave the settings belows is with the exception of the hadoop.tmp.dir parameter this parameter you must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop? default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don? be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /app/hadoop/tmp

Note:-We will create directory form root.

conf/*-site.xml In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop Distributed File System, HDFS, even though our little clusteronly contains our single local machine. Add the following snippets between the ... tags in the respective configuration XML file.

conf/core-site.xml


hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. In file conf/mapred-site.xml mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

In file conf/hdfs-site.xml: dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. he default is used if replication is not specified in create time.


Formatting the HDFS filesystem via the NameNode The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your ?luster(which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

1:-hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.


10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ hduser@ubuntu:/usr/local/hadoop$

Starting your single-node cluster Run the command:

hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out hduser@ubuntu:/usr/local/hadoop$

A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun? Java since v1.5.0). See also how to deburg Map Reduce Program.

hduser@ubuntu:/usr/local/hadoop$ jps


2287 TaskTracker 2149 JobTracker 1938 DataNode 2085 SecondaryNameNode 2349 Jps 1788 NameNode

Stopping your single-node cluster Run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

to stop all the daemons running on your machine.

Example output:

hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh

stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hduser@ubuntu:/usr/local/hadoop$

Web-based cluster UI Having covered the operational modes of Hadoop, we can now introduce the web interfaces that Hadoop provides to monitor the health of your cluster. The browser interface allows you to access information you desire much faster than digging through logs and directories.

The NameNode hosts a general report on port 50070. It gives you an overview of the state of your cluster .HDFS. Figure 2.4 displays this report for a 2-node cluster example. From this interface, you can browse through the filesystem, check the


status of each DataNode in your cluster, and peruse the Hadoop daemon logs to verify your cluster is functioning correctly.

Again, a wealth of information is available through this reporting interface. You can access the status of ongoing MapReduce tasks as well as detailed reports about completed jobs. The latter is of particular importance hese logs describe which nodes performed which tasks and the time/resources required to complete each task. Finally, the Hadoop configuration for each job is also available, as shown in figure 2.6. With all of this information you can streamline your MapReduce programs to better utilize the resources of your cluster.

http://localhost:50070 :-web UI of the NameNode daemon

Figure 2.4 A snapshot of the HDFS web interface. From this interface you can browse through the HDFS filesystem, determine the storage available on each individual node, and monitor the overall health of your cluster.


Figure 2.5 A snapshot of the MapReduce web interface. This tool allows you to monitor active MapReduce jobs and access the logs of each map and reduce task. The logs of previously submitted jobs are also available and are useful for debugging your programs.

Hadoop modules Apache Hadoop includes these modules

Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include 1:-Avro: A data serialization system.


2:-Cassandra: A scalable multi-master database with no single points of failure. 3:-Chukwa: A data collection system for managing large distributed systems. 4:-HBase: A scalable, distributed database that supports structured data storage for large tables. 5:-Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. 6:-Mahout: A Scalable machine learning and data mining library. 7:-Pig: A high-level data-flow language and execution framework for parallel computation. 8:-ZooKeeper: A high-performance coordination service for distributed applications.

Getting Start MapReduce Programming with Single Cluster MapReduce is a programming model for data processing. The model is simple, yet not too simple to express useful programs in. Hadoop can run MapReduce programs written in various languages; we shall look at the same program expressed in Java,Ruby, Python, and C++. Most important, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal. MapReduce comes into itsown for large datasets, so let? start by looking at one.

Writing a WordCount MapReduce sample

How to write a simple MapReduce program and how to execute it.


To run a MapReduce job, users should furnish a map function, a reduce function, input data, and an output data location. When executed, Hadoop carries out the following steps:

1:- Hadoop breaks the input data into multiple data items by new lines and runs the map function once for each data item, giving the item as the input for the function. When executed, the map function outputs one or more key-value pairs.

2:- Hadoop collects all the key-value pairs generated from the map function, sorts them by the key, and groups together the values with the same key.

3:- For each distinct key, Hadoop runs the reduce function once while passing the key and list of values for that key as input.

4:- The reduce function may output one or more key-value pairs, and Hadoop writes them to a file as the final result.

About combiner step to the WordCount MapReduce program

After running the map function, if there are many key-value pairs with the same key, Hadoop has to move all those values to the reduce function. This can incur a significant overhead. To optimize such scenarios, Hadoop supports a special function called combiner. If provided, Hadoop will call the combiner from the same node as the map node before invoking the reducer and after running the mapper. This can significantly reduce the amount of data transferred to the reduce step.


Developing a MapReduce Application with Hadoop

The domain of business problems that Hadoop was designed to solve, and the internal architecture of Hadoop that allows it to solve these problems. Applications that run in Hadoop are called MapReduce applications, so this article demonstrates how to build a simple MapReduce application.

Setting Up a Development Environment We will use three ebooks from Project Gutenberg for this example:

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson The Notebooks of Leonardo Da Vinci Ulysses by James Joyce Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg. In this case we create Directory ***************************************************************** $ sudo mkdir -p /app/hadoop/tmp/ gutenberg $ sudo chown hduser:hadoop /app/hadoop/tmp/ gutenberg # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /app/hadoop/tmp/ gutenberg

Affter we will copy all file in the gutenberg directory

By cp commond affter we will check

************************************************************************* hduser@ubuntu:~$ ls -l /tmp/gutenberg/


total 3604 -rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt -rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt -rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt hduser@ubuntu:~$

******************************************************************** Download and decompress this file on your local machine. If you are planning on doing quite a bit of Hadoop development, it might be in your best interest to add the decompressed bin folder to your environment PATH. You can test your installation by executing the hadoop command from the bin folder: bin/hadoop There are numerous commands that can be passed to Hadoop, but in this article we be focusing on executing Hadoop applications in a development environment, so the only one we will be interested in is the following:

hadoop jar

Phases in MapReduce Framework: MapReduce works by breaking the processing into two phases : 1:-Map phase 2:-Reduce phase Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programme also specifies two functions: the map function and the reduce function. The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file, but as we have no need for this, we ignore it. Our map function is simple. We pull out the year and the air temperature, since the se are the only fields we are interested in. In this case, the map function is just a data


preparation phase, setting up the data in such a way that the reducer function can do its work on it: finding the maximum temperature for each year. The map function is also a good place to drop bad records: here we filter out temperatures that are missing, suspect, or erroneous. To visualize the way the map works, consider the following sample lines of input data (some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999... These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...) The keys are the line offsets within the file, which we ignore in our map function. The map function merely extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature values have been interpreted as integers): (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-


value pairs by key. So, continuing the example, our reduce function sees the following input: (1949, [111, 78]) (1950, [0, 22, ?11]) Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading: (1949, 111) (1950, 22) This is the final output: the maximum global temperature recorded in each year. The whole data flow is illustrated in Figure 2-1. At the bottom of the diagram is a Unix pipeline, which mimics the whole MapReduce flow, and which we will see again later in the chapter when we look at Hadoop Streaming. Figure -1. MapReduce logical data flow

MapReduce Input and Output Formats The first program that you write in any programming language is typically a ?ello, World application. In terms of Hadoop and MapReduce, the standard application that everyone writes is the Word Count application. The Word Count application counts the number of times each word in a large amount of text occurs.It is a perfect example to learn about MapReduce because the mapping step and reducing step are trivial, but introduce you to thinking in MapReduce. The following is a summary of the components in the Word Count application and their function. FileInputFormat: We define a FileInputFormat to read all of the files in a specified directory (passed as the first argument to the MapReduce application) and pass those to a TextInputFormat (see Listing 1) for distribution to our mappers. TextInputFormat: The default InputFormat for Hadoop is the TextInputFormat, which reads one line at a time and returns the key as the byte offset as the key (LongWritable) and the line of text as the value (Text).


Word Count Mapper: This is a class that we write which tokenizes the single line of text passed to it by the InputFormat into words and then emits the word itself with a count of to note that we saw this word.

Combiner

While we dont need a combiner in a development environment, the combiner is an implementation of the reducer (described later in this article) that runs on the local node before passing the key/value pair to the reducer. Using combiners can dramatically improve performance, but you need to make sure that combining your results does not break your reducer: In order for the reducer to be used as a combiner, its operation must be associative, otherwise the maps sent to the reducer will not result in the correct result.

Word Count Reducer: The word count reducer receives a map of every word and a list of all the counts for the number of times that the word was observed by the mappers. Without a combiner, the reducer would receive a word and a collection of ?, but because we are going to use the reducer as a combiner, we will have a collection of numbers that will need to be added together.

TextOutputFormat: In this example, we use the TextOutputFormat class and tell it that the keys will be Text and the values will be IntWritable.

FileOutputFormat: The TextOutputFormat sends its formatted output to a FileOutputFormat, which writes results to a self created outputdirectory.

Sample Applications Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory of choice, for example /tmp/gutenberg.


We already have done

Restart the Hadoop cluster Restart your Hadoop cluster if it not running already.

1:-hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

Before we run the actual MapReduce job, we must first copy the files from our local file system to Hadoop HDFS.

1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs-copyFromLocal /tmp/gutenberg /user/hduser/gutenberg 2:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt hduser@ubuntu:/usr/local/hadoop$

Now that everything is prepared, we can finally run our MapReduce job on the Hadoop cluster. As I said above, we leverage the Hadoop Streaming API for helping us passing data between our Map and Reduce code via STDIN and STDOUT.


Now, we actually run the WordCount example job 1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

Note: Some people run the command above and get the following error message Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar at org.apache.hadoop.util.RunJar.main (RunJar.java: 90) Caused by: java.util.zip.ZipException: error in opening zip file

In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:

1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

Example output of the previous command in the console: hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001 10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0% 10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0% 10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0% 10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%


10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17 10/05/08 17:43:28 INFO mapred.JobClient: Job Counters 10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1 10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512 10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918 10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330 10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290 10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286 10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934 10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796 10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290 10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874 10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267 10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187 10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286

Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output: hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 2 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output


Found 2 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000 hduser@ubuntu:/usr/local/hadoop$

Retrieve the job result from HDFS: To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you Can use the command,

1:-hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000

hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1 "Alack! 1 hduser@ubuntu:/usr/local/hadoop$

You can see Screen Short here,


JobTracker Web Interface (MapReduce

layer) The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ?local machine?? Hadoop log files (the machine on which the web UI is running on).

By default, it is available at http://localhost:50030/.

1:- Scheduling


2:- Running job

Killing Jobs

Unfortunately, sometimes a job goes awry after you are started it but it doesn actually fail. It may take a long time to run or may even be stuck in an infinite loop. In (pseudo-) distributed mode you can manually kill a job using the command

bin/hadoop job -kill job_id

where job_id is the job ID as given in JobTracker Web UI.

Summary

We are discussed the key nodes and the roles they play within the Hadoop architecture. You are learned how to configure your cluster, as well as manage some basic tools to monitor your cluster overall health.


Overall, this chapter focuses on one-time tasks. Once you are formatted the NameNode for your cluster, you will (hopefully) never need to do so again. Likewise, you shouldnt keep altering the hadoop-site.xml configuration file for your cluster or assigning daemons to nodes. In the next chapter, you will learn about the aspects of Hadoop you will be interacting with on a daily basis, such as managing files in HDFS. With this knowledge you will be able to begin writing your own MapReduce applications and realize the true potential that Hadoop has to offer.

Documents

Getting Started With Hadoop