View
870
Download
0
Category
Preview:
Citation preview
1
www.aditi.com
Introducing Hadoop on Azure M Sheik Uduman Ali Technical Architect, Aditi Technologies
Instead of reinventing the wheel, Microsoft takes a strong and
brilliant move to integrate Hadoop on its blockbuster cloud
computing PaaS stack. Isn't it? Of course, LINQ2HPC was
embraced many .NET developers, however, Hadoop distribu-
tion for Windows is also the safest move. This paper evaluates
the early preview of Hadoop on Azure. It cover the basics of
using Hadoop on Azure. It would be helpful to read
about MapReduce and Hadoop Topology before learning
about Hadoop on Azure.
For comments or questions regarding the content of this pa-
per, please contact
Sunny Neogi (sunnyn@aditi.com) or
Arun Kumar (arung@aditi.com)
www.aditi.com
2
www.aditi.com
Why do we need Hadoop?
The simple answer to this question is "Big data analysis". Some examples of
big data analysis are:
Calculating consumers purchasing trend on particular product categories
based on the growing big data with the rate of 1 million transactions per
hour
Web application log analysis
Internet search indexing
Social network data
Since relational databases and its ecosystem were designed on "scale-up"
strategy with centralized data processing, they are not much suitable for data
warehousing space. And the data persistence of modern applications is mix
and match of relational, structured and non-structured. Hence, we need a
much more powerful system. Hadoop is one of the successful open source
platform based on MapReduce principle, which in turn follows the "Making
big by small" philosophy.
The big data processing is called as "Job" since it would be done very fre-
quently, periodically, some in a while or only once. It is not to be part of day
to day business.
3
www.aditi.com
ABOUT ADITI
Basically, the input data is processed on "n" number of small physical nodes in
a clustered environment in two different phases:
Map: The input data needs to be grouped as <k1, v1> key-value pair. For
example, if the input data reside in one or more files, then k1 would be the
file name and v1 be the file content. Hence, the map phase receives list of
<k1, v1>. It splits each k1 into available map nodes in the cluster. On every
node, the mapping function mostly performs "filtering and transfor-
mation“ and produces <k2, v2>. For example, if you want to count the
number of occurrences of words in the given set of documents, <filename,
content> as <k1, v1> and the nodes in the mapping phase does counting
the words in the given v1. This will generate output like <"aditi", 1> as <k2,
v2> for every occurrence of the word "Aditi" in a document. Here, "aditi" is
one of the words in the document. Hence, the output of mapping phase is
list of <k2, v2>. For example, there are many <"aditi", 1> in the <k2, v2>.
Reduce: All <k2, v2> are aggregated and created <k2, list(v2)>. In the
word count example, a node in the Hadoop cluster may produce may
<"aditi", List(1, 1, 1, 1)> from all the documents from different nodes. Eve-
ry list(v2) for k2 passed to a node for reducing. The output will be list of
<k3, v3>. For example, if a node receives "aditi" as k2, it just accumulates
all List(1+1+1+1) as v2 and produces 4 as v3. Here, k3 is again
"aditi". Each reducer node does the same for different words.
The <k2, v2> aggregation is actually performed by a component called
"combiner". As of now, let us keep focus on the mapper and reducer.
See the below figure (figure 1):
What are the layers of
Architecture?
What is MapReduce?
4
www.aditi.com
ABOUT ADITI
Hadoop cluster is an infrastructure with many physical nodes, where some are
configured for "mapping" and some are for "reducing" along with administra-
tive, tracking and data persistence nodes called as "Name Node", "Job Track-
er", "Task Tracker" and "Data Node" respectively. This is a master/slave archi-
tecture "Name Node" and "Job Tracker" are masters and remaining are
slaves. This is shown in figure 2.
In order handle big data storage and processing, Hadoop uses HDFS as a file
system which even handle 100 TB content as a single file.
What are the layers of
Architecture? Hadoop Cluster
5
www.aditi.com
ABOUT ADITI
Since every task is called as "Job", you can rent required nodes for your job,
use and release. Hence, the elastic computing and data storage (blob and ta-
ble storage) in Azure is definitely the good choice for running your Hadoop
job. The home land for Hadoop is Java, at this early stage on Azure, Hadoop
Java SDK is one of the good options for your job. In addition to this, the
"Hadoop on Azure" leverages the elasticity of Azure storage with Hadoop
streaming, by which you can write your job on C# or F# and use Azure blob
for data persistence (the scheme is called as ASV). The figure below shows the
Hadoop ecosystem on Azure (figure 3).
To create directories, get and put files, and issue some data processing com-
mands on HDFS/ASV, Azure provides interactive JavaScript console. (In the ac-
tual Hadoop distribution, Java is the main interface for this). In addition to
this, Azure supports Hive (SQL like language in Hadoop) and Pig Latin (high
level data processing language).
What are the layers of
Architecture? Hadoop Ecosystem on Azure
6
www.aditi.com
ABOUT ADITI
The www.hadooponazure.com is the management portal to create, release
and renew clusters for your job. The following are the steps you need to per-
form to run job:
1. Develop the mapping and reducing functions either in Java or your pre-
ferred platform. For non-Windows, it could be shell scripts, ruby, php, Py-
thon, etc. In Azure, you can write the code in .NET.
2. Decide from where the input data and output result of the job need to be
managed. Either in HDFS or Azure Blob.
3. Request a cluster for the job in the portal
4. Specify all the parameters for the job which includes the executable for the
job, input and output details
5. Run the job and get the output
6. Release the cluster
In this post, let us see the step 3, how we can create a cluster for a job.
Requesting a new Cluster
After you entered into the portal, you need to enter the following details for
the new cluster environment as shown in the below figure (figure 4):
DNS name (<dnsname>.cloudapp.net)
Cluster size - like Azure role size. 4 nodes + 2 TB disk space = small, 32
nodes + 16 TB = extra large
Cluster login information
What are the layers of
Architecture?
The Web Portal for Hadoop on Azure
7
www.aditi.com
ABOUT ADITI
After entering these details, press Request Cluster button. This will create the
cluster environment for your job. The screen shows the progress of creating
new nodes for the cluster as shown in the below figure (figure 5):
8
www.aditi.com
ABOUT ADITI
After the provisioning, you will see a screen as shown below (figure 6):
You can start create a new job and if you want to access the environment you
can use either "Interactive Console" or “Remote Desktop".
9
www.aditi.com
ABOUT ADITI
The above figure is a Hadoop Streaming based job.
——————————————————————————————————
About the Author:
M Sheik Uduman Ali is a cloud architect at Aditi who has involved in cloud practic-
es. He is a blogger and published an online book about "Domain Specific Languages
in .NET".
Aditi helps product companies, web businesses and enterprises leverage the power of cloud, e-social and mobile, to
drive competitive advantage. We are Microsoft cloud partner of the year; one of the top 3 Platform-as-a-Service so-
lution providers globally and one of the top 5 Microsoft technology partners in US. We are passionate about emerg-
ing technologies and are focused on custom development.
ABOUT ADITI
When you click on new job, you will see the below screen (figure 7):
Recommended