Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Intel® Distribution for Apache
Hadoop™ on Dell PowerEdge
Servers
A Dell Technical White Paper
Armando Acosta
Hadoop Product Manager
Dell Revolutionary Cloud and Big Data Group
Kris Applegate
Solution Architect
Dell Solution Centers
Dave Jaffe, Ph.D.
Solution Architect
Dell Solution Centers
Rob Wilbert
Solution Architect
Dell Solution Centers
2 Dell | Intel® Distribution for Apache Hadoop
Executive Summary
This document details the deployment of Intel® Distribution for Apache Hadoop* software on the
PowerEdge R720XD. The intended audiences for this document are customers and system architects
looking for information on implementing Apache Hadoop clusters within their information technology
environment for Big Data analytics.
The reference configuration introduces all the high-level components, hardware, and software that
are included in the stack. Each high-level component is then described individually.
Dell developed this document to help streamline deployment, provide best practices and improve the
overall customer experience.
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN
TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS,
WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND.
© 2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without
the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Intel and Xeon are registered
trademarks of Intel Corp. Red Hat is a registered trademark of Red Hat Inc. Linux is a registered
trademark of Linus Torvalds. Other trademarks and trade names may be used in this document to
refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any
proprietary interest in trademarks and trade names other than its own.
July 2013
3 Dell | Intel® Distribution for Apache Hadoop
Table of Contents 1 Introduction ................................................................................................................................................... 5
2 Dell Solution Centers .................................................................................................................................... 7
3 Dell’s Point Of View on Big Data ................................................................................................................ 8
4 Intel Distribution for Apache Hadoop ....................................................................................................... 9
Hadoop Use-Cases ......................................................................................................................................... 10
Intel’s Contributions to Open Source .......................................................................................................... 10
5 Intel Hadoop Solution Software Components ...................................................................................... 12
Server Roles .......................................................................................................................................................12
6 Best Practices for Running Intel Distribution of Apache Hadoop on Dell ........................................ 14
Node Count Recommendations .................................................................................................................. 14
Hardware Recommendations ........................................................................................................................ 15
Monitoring ..................................................................................................................................................... 15
Resiliency ...................................................................................................................................................... 16
Performance ................................................................................................................................................. 17
Software Considerations ................................................................................................................................ 18
Installation Environment Assumptions .................................................................................................... 18
High Availability ........................................................................................................................................... 18
Installation Considerations ............................................................................................................................ 19
7 Testing ........................................................................................................................................................... 21
HiBench .............................................................................................................................................................21
Teragen / Terasort ...........................................................................................................................................21
Tested Configuration .......................................................................................................................................21
Tuning and Optimization of Workloads ...................................................................................................... 22
8 Conclusions ................................................................................................................................................. 24
9 Resources ..................................................................................................................................................... 25
Links ................................................................................................................................................................... 25
Additional Whitepapers .................................................................................................................................. 25
Tables Table 1. Recommended Cluster Sizes .......................................................................................................... 14
Table 2. Software Revisions .............................................................................................................................21
Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration ...............................................21
4 Dell | Intel® Distribution for Apache Hadoop
Table 4. PowerEdge R720XD Datanode As Tested Configuration .......................................................... 22
Table 5. Key Hadoop Configuration Parameters ........................................................................................ 22
Figures
Figure 1. Dell Solution Centers Locations ........................................................................................................ 7
Figure 2. Big Data Demands ............................................................................................................................... 8
Figure 3. Intel Foundational Technologies for Hadoop Performance ....................................................... 9
Figure 4. Dell Big Data Cluster Logical Diagram ...........................................................................................13
Figure 5. Ganglia Performance Monitor Tool (Included with IDH) ............................................................13
Figure 6. Cluster Network Diagram ................................................................................................................. 15
Figure 7. Dell’s OpenManage Power Center ................................................................................................. 16
Figure 8. Dell R720XD models with 2.5” and 3.5” inch drives ..................................................................... 17
Figure 9. The Role Assignment dropdown for HDFS roles ......................................................................... 19
Figure 10. Mount Points are configured below for the dfs.data.dir directories .................................... 20
Figure 11. Intel Active Tuning Technology ................................................................................................. 23
5 Dell | Intel® Distribution for Apache Hadoop
1 Introduction
Hadoop is an Apache open source project being built and used by a global community of
contributors, using the Java programming language. Hadoop’s architecture is based on
the ability to scale in a nearly linear capacity. By harnessing the power of this tool, many
customers who previously would have had difficulty sorting through their complex data
can now deliver value faster, provide deeper insight, and even develop new business
models based off the speed and flexibility these analytics provide.
However, installing, configuring and running Hadoop is not trivial. There are different roles
and configurations that need to be deployed on various host computers. Designing,
deploying and optimizing the network layer to match Hadoop’s scalability requires
consideration for the type of workloads that will be running on the Hadoop cluster. These
issues are complicated by both the fast-moving pace of the core Hadoop project and the
challenges of managing a system designed to scale to thousands of nodes in a cluster.
Dell’s customer-centered approach is to create rapidly deployable and highly optimized
end-to-end Hadoop solutions running on highly scalable hardware. Dell listened to its
customers and partnered with Intel to design a Hadoop solution that is unique in the
marketplace, combining optimized hardware, software, and services to streamline
deployment and improve the customer experience.
Intel has created a high quality, controlled distribution of Hadoop and offers commercial
management software, updates, support and consulting services.
The Intel® Distribution for Apache Hadoop (IDH) software includes:
The Intel® Manager for Apache Hadoop software to install, configure, monitor and
administer the Apache Hadoop cluster
Enhancements to HBase and Hive for improved query performance and end user
experience
Resource monitoring capability using Nagios and Ganglia in the Intel® Manager
Superior security and performance through tightly integrated encryption and
compression, authentication and access control.
Packaged Apache Hadoop ecosystem that includes HBase, Hive, and Apache Pig,
among other tools
This solution provides a foundational platform for Intel to offer additional solutions as the
Apache Hadoop ecosystem evolves and expands. Aside from the Apache Hadoop core
technology (HDFS, MapReduce, etc.) Intel has designed additional capabilities to address
specific customer needs for Big Data applications such as:
Optimal installation and configuration of the Apache Hadoop cluster
Monitoring, reporting, and alerting of the hardware and software components
Providing job-level metrics for analyzing specific workloads deployed in the cluster
Infrastructure configuration automation
6 Dell | Intel® Distribution for Apache Hadoop
In recent tests in the Dell Solution Center, the Intel® Distribution for Apache Hadoop
Release 2.4.1 was installed and tested on a cluster of Dell® PowerEdge® R720 servers,
resulting in a set of best practices for installing IDH on Dell clusters.
The next sections describe the role of the Dell Solution Centers and Dell’s point of view on
Big Data, followed by details of the IDH solution and IDH software components. Finally the
best practices developed by the Solution Center and the results of the IDH on Dell tests
are described.
7 Dell | Intel® Distribution for Apache Hadoop
2 Dell Solution Centers The Dell Solution Centers (DSC) are a global network of connected labs that allow Dell to
help customers architect, validate and build solutions across Dell’s entire enterprise
portfolio. The Dell | Intel Cloud Acceleration Program (DICAP), a team within the Dell
Solution Centers, has the mission of providing customer engagements on the topics of
Cloud and Big Data.
With centers in every region, the DSC engages customers through informal 30-60 minute
briefings, longer half-day architectural design sessions, and one to two-week proof-of-
concept tests that enables customers to “kick the tires” of Dell solutions prior to purchase.
Interested customers should engage with their Dell account team to access the services of
the DSC.
Figure 1. Dell Solution Centers Locations
Sao Paulo and Dubai coming in the second half of 2013
8 Dell | Intel® Distribution for Apache Hadoop
3 Dell’s Point Of View on Big Data
“Big Data” is a term often hyped in the IT press. There are many different interpretations of
what exactly this means. In Dell’s point of view the methods and principles of Big Data
aren’t new to the computer industry. In High Performance Clustered Computing (HPCC),
data warehouses, and traditional databases, Dell has been providing these solutions for
years. What has changed is the scale at which such tools need to operate. Every new
device in use in today’s society gathers more and more data and the need to store, report
and analyze it is paramount. The term “big” can be implied on a variety of different scales:
(See Figure 2)
Volume – no longer in the realm of gigabytes, but rather terabytes or petabytes.
Velocity – devices now can generate more data in a small time than can be
ingested using traditional means.
Variety – with the data types and schemas of all the various datasets differing so
much, being able to use a common datastore and to query across them provides
tremendous value.
Figure 2. Big Data Demands
9 Dell | Intel® Distribution for Apache Hadoop
4 Intel® Distribution for Apache Hadoop Dell continues to hear from customers about their Big Data challenges, specifically a need
for solutions that allow flexibility and choice while enabling key insights from their data.
Based on customer conversations and Dell’s experience in providing Hadoop solutions,
one size does not fit all. Each Hadoop distribution offers unique features and benefits. For
this very reason, Dell is introducing the partnership with Intel for the Intel® Distribution for
Apache Hadoop* software on the PowerEdge R720XD.
The Dell and Intel partnership is good for all customers that want value from their data.
Both companies share a common goal to help build a robust Apache Hadoop ecosystem
that is enterprise ready, allowing all customers to take advantage of this disruptive
technology. The partnership provides stability to the Apache Hadoop open source project;
both companies have long term strategies that will help drive the right capabilities and
features bringing the most value to customers.
Intel brings a unique value proposition for customers: the ability to enable an optimized
solution from the CPU silicon all the way to the Hadoop distribution. Intel is
is the only vendor that can marry CPU technologies, SSD technology and 10Gb Ethernet to
benefit Hadoop performance. The Intel® Distribution for Apache Hadoop software
focuses on performance and security. The Dell and Intel strategy is to reinforce the
Hadoop distribution by making it more enterprise ready and provide a viable platform for
big data workloads in all IT environments. The Intel® Distribution for Apache Hadoop
software is especially suited for use cases where security and performance and ease of
data management are key needs.
Figure 3. Intel Foundational Technologies for Hadoop Performance
10 Dell | Intel® Distribution for Apache Hadoop
Hadoop Use-Cases
The Intel® Distribution for Apache Hadoop has been deployed in many different customer
scenarios. A few use cases that stand out are in healthcare, telecommunications and
smart-grid technology:
Healthcare – Customers use the massive database capabilities of IDH to store and process
the human genome, evaluate pharmaceutical results and make patient care decisions. In
genomic research,, the fact that each human genome consists of 3.2 billion base-pairs
with upwards of 4 million variants, drives the need for a cost-effective, high performance,
scalable data processing engine.. At the same time, the deep security enhancements IDH
provides are of major importance to the healthcare industry’s strict compliance
regulations.
Telecommunications – More and more mobile devices are getting into the hands of
people all over the world. The billing systems for mobile providers need to be able to track
call lengths and durations, text messages and data usage. More importantly they need to
be able to report on this in near real-time. Hadoop is used instead of traditional massively
parallel processing (MPP) and data-warehouse (DW) technologies due to its lower total
cost of ownership (TCO) and inherent fault-tolerance.
Energy Smart-Grid – Mobile devices aren’t the only thing generating new data streams.
Smart power meters generate large streams of sensor data that can be used by energy and
utility companies to optimize service delivery. The ability to efficiently store this data is
allowing these companies to increase the rate of collection and provide additional, more
granular detail. Traditional databases are proving to be incapable of handling the ingestion
rate of this data at an affordable cost.
Intel’s Contributions to Hadoop As with many other open source projects, Hadoop’s power owes itself to the community
that developed it. Contribution to open source projects, either directly, or by enhancing
the ecosystem drives further adoption and deepens utilization. Intel has a long history of
both contributing to core open source projects (Linux Kernel, Hadoop and KVM) as well as
creation of complementary projects. Two key programs to note in the context of Hadoop
are:
11 Dell | Intel® Distribution for Apache Hadoop
Project Rhino – This Intel-driven project enhances the data protection capabilities of Hadoop
to address the security and compliance challenges around emerging use-cases. More details
can be found at https://github.com/intel-hadoop/project-rhino/
Project Panthera – This project’s goal is to provide full SQL support to help companies
integrate Hadoop more deeply with their existing data analytics processes. More details can
be found at https://github.com/intel-hadoop/project-panthera.
12 Dell | Intel® Distribution for Apache Hadoop
5 Intel Hadoop Solution Software Components Hadoop Distributed File System (HDFS) – This is the clustered file system that is at the
core of the Hadoop software stack. When data is stored on this file system it’s
automatically distributed for both resiliency and redundancy. In the default configuration,
every file is stored 3 times on 3 different nodes. With Intel Hadoop, tunable parameters can
be set to increase or decrease the file replication level as the file access frequency
increases or decreases.
MapReduce – This is the distributed batch-oriented parallel processing framework that
enables data analysis at a large scale. This framework is accessed by writing Java-based
MapReduce jobs that get executed against datasets in HDFS.
Hive – Hive makes accessing the power of MapReduce more familiar to existing database
customers. It exposes the data that resides on HDFS as a SQL-like database. Standard SQL
queries run against this data will be translated into MapReduce by Hive and executed
behind the scenes. With Intel Hadoop, Hadoop queries can run faster on data sets in
Hbase.
HBase – Some use-cases dictate the need for faster response times than a batch-based
job through Hive or MapReduce. For these use cases, HBase provides a non-relational,
column-based, distributed database that resides directly on top of HDFS. This allows users
to leverage HDFS’s massive scalability to provide service to emerging non-traditional
databases. The Hbase distribution in IDH is tuned to perform ad hoc queries faster via Hive
for large datasets.
Server Roles Name Node/JobTracker(s) – These nodes serve as control nodes for the HDFS,
MapReduce, and HBase processes. For HDFS, they own the block map and directory tree
for all the data on the cluster. With MapReduce, they own the Job Tracker daemon that
handles job execution and monitoring. Lastly with HBase, these servers are responsible for
running the monitoring processes as well as owning any metadata operations. Production
environments should have a primary and at least one standby Name Node.
Data Node(s) – These are the nodes that hold the data as well as execute the MapReduce
jobs. They are generally filled with large amounts of local disks, enabling the parallel
processing and distributed storage features of Hadoop. The number of Data Nodes is
dictated by use case. Adding additional Data Nodes increases both performance and
capacity simultaneously.
Edge Node(s) – These servers lie on the perimeter of the dedicated Hadoop network. They
are where external users and business processes interact with the cluster. Often times they
will have a number of Network Interface Cards (NICs) attached to the Hadoop network as
well as separate NICs attached to the enterprise’s production IT network. More Edge
nodes can be added as external access requirements increase.
Intel® Manager Node – This node is where the installation of the Intel Manager software
will reside. It runs the configuration management processes, web server software, and
performance monitoring software. In production installations, a dedicated server should
fulfill this task. In smaller installations such at the one employed by Dell in these tests, this
role was shared with the Edge Node.
13 Dell | Intel® Distribution for Apache Hadoop
Figure 4. Dell Big Data Cluster Logical Diagram
Figure 5. Ganglia Performance Monitor Tool (Included with IDH)
14 Dell | Intel® Distribution for Apache Hadoop
6
Best Practices for Running Intel Distribution on
Dell
Node Count Recommendations Dell recognizes that use-cases for Hadoop range from small development clusters all the
way through large multi petabyte production installations. Dell has a Professional Services
team that sizes Hadoop clusters for a customer’s particular use. As a starting point three
cluster configurations can be defined for typical use:
Minimum Development Cluster – This is targeted at functional testing and may even be
built from existing equipment. However, the performance of these types of clusters can be
significantly lower as they don’t benefit from the highly distributed nature of HDFS.
Recommended Small Cluster – This is a good starting point for customers taking the
initial steps into running IDH in production. It provides some layers of resiliency that is
expected in today’s production IT world.
Recommended Production Cluster – This configuration provides all the available options
for resiliency both at a hardware layer and software layer. In addition, it allows for an
adequate number of data nodes to demonstrate the performance benefits of distributed
storage and parallel computing.
Table 1. Recommended Cluster Sizes
Minimum Development
Cluster
Recommended Small
Cluster
Recommended
Production Cluster
Name Node(s)2 1
1 2
2 2
2
Edge Node(s) 01 1 1
Data Node(s) 3 5 15
Intel Manager Node 01 1 1
15 Dell | Intel® Distribution for Apache Hadoop
1 GbE Switches 1 1 2
10 GbE Switches 0 2 2
Rack Units 9U 20U 42U 1 In this case a single node serves as the Name, Job Tracker, Edge and Intel Manager Node.
2 In some cases a single server can serve as both the Name and Job Trackers
Figure 6. Cluster Network Diagram
Hardware Recommendations
Dell’s complete portfolio really shines when building on comprehensive solutions. From
the servers to the switches and even on down to the Racks and monitoring tools, the value
of deploying on Dell is readily apparent.
Monitoring
Using the Dell Remote Access Card (DRACs) in the servers Dell customer can identify
increases in power consumption and temperature through as they exercise the disks and
CPUs. One great tool to aid with this is Dell ‘s OpenManage Power Center. This tool uses
the Intel Network Node Manager technology built into Dell Remote Access Controller
(DRAC) to provide metrics and trigger alert events based on customer criteria.
16 Dell | Intel® Distribution for Apache Hadoop
Figure 7. Dell’s OpenManage Power Center
Resiliency
In production clusters it’s imperative to keep an eye towards mitigating as many points of
failure as possible. However, it is important to keep in mind that Hadoop (both through
HDFS and MapReduce) is meant to be natively tolerant of failures and will take care of
much of the needed underlying work. That said, when investing in building a robust and
resilient configuration here are key areas to focus on:
Switches –Multiple stacked Force 10 switches should be used for high availability. Force
10 S60 1GbE switches utilize stacking modules which provide for easier switch
management and faster inter-switch communication. On the Force 10 S4810s there is the
option of either stacking via the 10 or 40 GbE ports (FW 8.3.12+) or implementing Virtual
Link Trunking if you plan to scale beyond the stacking limitations (See switch
documentation for configuration maximums).
NICs – Either two single-port NIC cards or two dual-port cards are recommended in the
administration servers to guard against PCI-E slot failures. This is not as crucial on
datanodes due to datanode redundancy.
Disks – RAID is only recommended in the administration servers such as the Namenode.
In the Data nodes it’s strongly recommended to put as many separate disks as possible (no
RAID). The flexibility of the PowerEdge R720XD really shines here since it can hold either
(12) 3.5” drives or (24) 2.5” drives.
17 Dell | Intel® Distribution for Apache Hadoop
Figure 8. Dell R720XD models with 2.5” and 3.5” inch drives
Performance
Performance optimization is a matter that varies greatly from customer to customer. There
are a few principles that should be considered in order to optimize cluster performance.
Network – While 10 GbE isn’t required, multiple bonded NICs of the fastest speed possible
are strongly recommended for the data network. Workloads vary on whether or not they
can truly benefit from a fast network, but with the prevalence of 10 GbE, it would be a wise
idea to invest ahead of the curve. You’ll also want enterprise-grade switches with deep
per-port packet buffers in order to handle the volume and density of traffic Hadoop can
generate. For 1 GbE Dell Force10 Series 60 work well and at 10 GbE Dell Force10 S4810s
are optimal.
Disks – A key principle of performance tuning is to eliminate input/output (IO) starvation
at the CPU layer and contention at the disk level. From this comes the initial
recommendation of a 1:1 ratio of disk spindle to physical processor core (with hyper-
threading counting as half of one physical core for this purpose). The correct choices of
disks and processors totally depends on the workload, which can vary from the heavily
storage -centric, with massive disks and few processors, to heavily processor centric, with
many cores and PCI-E SSDs. The Dell Professional Services team can provide consultation
and assessment to help customers achieve the proper balance. The Dell PowerEdge
R720XD provides the excellent flexibility with regards to drive and socket configurations.
Memory – Few Hadoop use-cases will be memory constrained but administration servers
should have sufficient memory for index caching (128GB for a robust configuration). For
the data nodes, while, there are emerging use-cases that call for high amounts of
memory, it’s been determined through Hadoop customer engagements in the Dell
Solution Centers that 64GB is a good target initially.
CPUs – As mentioned above, the use-case will determine the correct balance of CPU,
Memory, and disk speed. In performance use-cases the most cores (balancing out spindle
count if not SSD) and the highest possible frequency CPUs are recommended. However, if
you were more interested in storage capacity, you could look at some of the Intel Xeon
E5-2600L series processors that are more energy efficient.
18 Dell | Intel® Distribution for Apache Hadoop
Software Considerations
Installation Environment Assumptions
Updated Operating System –the selected OS should have appropriate updates applied
prior to IDH installation. The IDH documentation lists supported OS versions as well as
required updates.
Package Management – As part of the installation an existing OS package repository
needs to be referenced. Additionally, a new repo for IDH software needs to be created. In
some cases (Red Hat Enterprise Linux) this may mean registering the OS with the proper
credentials.
DNS – Forward and reverse name resolution are required for installation. Hosts to host
communication will be handled by hostname so this becomes imperative. This can be
accomplished via /etc/hosts or a DNS server.
NIC Bonding – In order to get as much bandwidth and resiliency as possible, Dell,
recommends implementing bonding on the NICs. In these tests, mode 6 (balance-ALB)
was used.
Production Network Connectivity – The Edgenode needs to be connected to the user’s
existing network in order to facilitate access to the cluster. The speed of this link should
meet the needs of the inbound data ingestion plans (both in number of users/processes as
well as volume of data).
High Availability
Production Hadoop workloads require a high degree of resiliency to achieve desired
uptime goals. In IDH 2.4.1 High Availability (HA) is handled in an Active/Passive manner
using a number of components:.
Distributed Replicated Block Devices (DRBD) –allows a logical device to be mirrored
between two disparate systems
Pacemaker – a Cluster Resource Management (CRM) framework that starts, stops,
monitors and migrate resources automatically.
Corosync – a messaging framework, which Pacemaker uses, for internode
communication.
These tools, when used together, provide layers of redundancy for both the HDFS
NameNode service and the MapReduce JobTracker. In order to enable HA, additional
hardware may be required in the namenodes including extra NICs, more memory, and
additional disks. While both, Namenode HA service as well as Jobtracker HA service
failover is completely automatic, once the failover completes, in-flight jobs will be required
to be resubmitted.
High availability will require some additional network configuration as well. Virtual
hostnames and IP addresses for both the NameNode and the TaskTracker HA functions
must be identified and recorded in all /etc/hosts files, or DNS tables.
It’s worthy of note that the IDH 2.4 release is based off of the 1.x Hadoop open source
project that had no HA option inherent, but Intel’s distribution adds this capability.
19 Dell | Intel® Distribution for Apache Hadoop
Installation Considerations Role Assignments
During the installation, the setup wizard prompts for specific role assignments of the
cluster servers. It’s a good idea to use the “Edit Roles” button on the last page of the wizard
to double-check that each of the parameters was set correctly, as shown in Figure 9.
Figure 9. The Role Assignment dropdown for HDFS roles
Mount Points
Mount points are key, to properly configure an optimized cluster. It’s always best practice
to be following the installation guide, and prior to starting HDFS or any of the services,
make sure that the values set for dfs.data.dir (Figure 10) and mapred.data.dir are set to the
appropriate mount points. In the case below, there is one mount point per physical spindle
allocated.
20 Dell | Intel® Distribution for Apache Hadoop
Figure 10. Mount Points are configured below for the dfs.data.dir directories
21 Dell | Intel® Distribution for Apache Hadoop
7 Testing Setup
HiBench Hibench is a Hadoop benchmark framework that consists of 9 typical workloads
representing common Hadoop workloads. These consist of micro benchmarks, HDFS
benchmarks, web search benchmarks, machine learning benchmarks, and data analytics
benchmarks. For this paper the most well-known subset of the HiBench suite, the Teragen
/ Terasort benchmark, was employed to test system IO.
Teragen / Terasort These two HDFS / MapReduce benchmarks are used in conjunction with each other to
stress Hadoop systems and provide valuable metrics with regards to network, disk and
CPU utilization. By starting with these as a baseline, Hadoop administrators can tune
Hadoop’s wide variety of parameters to get the desired performance. Teragen starts by
generating flat text files that contain pseudo-random data that Terasort then sorts. This
type of sort / shuffle exercise is similar to what is done over and over by customers as they
manipulate data through MapReduce jobs.
Tested Configuration In these tests a small Hadoop cluster was employed as recommended in Table 1. The
specific software revisions used in the test are shown in Table 2. The PowerEdge R720 and
R70XD hardware configurations are shown in Table 3 and Table 4. The hardware listed
should be used as initial guidance only. Additional configurations are very possible and will
likely be required as each customer’s environment and use-case is unique.
Table 2. Software Revisions
Component Revision
Redhat Enterprise Linux 6.4
Intel Distribution for Apache Hadoop
2.4.1 (Build 16962)
Apache Hadoop (IDH is based on)
1.0.3
Hbase 0.94.1
Hive 0.9.0
Zookeeper 3.4.5
HiBench 2.2
Table 3. PowerEdge R720 Infrastructure Node As Tested Configuration
Component Detail
Height 2 Rack Units (3.5”)
Processor 2x Intel Xeon E5-2650 2 GHz 8-core procs
Memory 128 GB
Disk 6x 600 GB 15K SAS Drives
22 Dell | Intel® Distribution for Apache Hadoop
Network 4x 1GbE LOMs, 2x 10GbE NICs
RAID Controller PowerEdge RAID Controller H710 (PERC)
Management Card Integrated Dell Remote Access Controller (iDRAC)
Table 4. PowerEdge R720XD Datanode As Tested Configuration
Component Detail
Height 2 Rack Units (3.5”)
Processor 2x Intel Xeon E5-2667 2.9 GHz 6-core procs
Memory 64 GB
Disk 24x 500GB 7200 RPM Nearline SAS drives
Network 4x 1GbE LOMs, 2x 10GbE NICs
RAID Controller PowerEdge RAID Controller H710 (PERC)
Management Card Integrated Dell Remote Access Controller (iDRAC)
Tuning and Optimization of Workloads
The cluster configuration variables used in these tests (Table 5) are simply a starting spot.
Parameters like dfs.block.size would be highly contingent on the type of data being stored
and the use-case thereof. A Dell Professional Services engagement is recommended to
achieve configurations optimized for the user’s workload.
Table 5. Key Hadoop Configuration Parameters
Name Value dfs.block.size 134217728 ipc.server.tcpnodelay FALSE ipc.client.tcpnodelay FASLE
io.sort.factor 100 io.sort.mb 400 io.sort.spill.percent 0.8 io.sort.record.percent 0.05 mapred.child.java.opts 1024m mapreduce.tasktracker.outofband.heartbeat TRUE mapred.job.reuse.jvm.num.tasks 1 mapred.min.split.size 134217728 mapred.reduce.parallel.copies 20 mapred.reduce.tasks.speculative.execution TRUE mapred.reduce.tasks 30* # Task Trackers mapred.map.tasks 20 * # of Task Trackers
mapred.compress.map.output TRUE tasktracker.http.threads 60
23 Dell | Intel® Distribution for Apache Hadoop
io.buffer.file.size 4096 io.bytes.per.checksum 4096 mapred.task.timeout 1800000 mapred.tasktracker.map.tasks.maximum 30 mapred.tasktracker.reduce.tasks.maximum 20
Intel® Active Tuner
As part of IDH, Intel provides a unique tool that can help users optimize configuration
parameters. A small MapReduce job is created and uploaded along with command-line
parameters. The Active Tuner runs it for a pre-determined number of iterations while
adjusting the known-performance enhancing parameters to arrive at an optimal
configuration tuned for that workload.
Figure 11. Intel® Active Tuner
24 Dell | Intel® Distribution for Apache Hadoop
8 Conclusions For enterprises looking to take advantage of the wealth of available data, the Intel®
Distribution for Apache Hadoop running on Dell PowerEdge server clusters provides a
robust platform for Big Data applications. Intel distribution stands out from others with its
high availability features and the Intel Active Tuning tool. IDH also takes key steps into
emerging areas of interest for customers around encryption and security of Big Data. With
a proven track record of supporting large genomics and telecommunications customers,
IDH is an attractive Hadoop solution offering. Deploying Intel® Distribution for Hadoop on
Dell’s award-winning hardware, results in a high quality, cost-effective Hadoop platform
for everyone. This Hadoop solution from Dell and Intel, for big data applications, benefits
all types of customers, from those that are just starting their investigation into Hadoop
technology, to those who are ready to build out large clusters for petabyte-scale
applications.
25 Dell | Intel® Distribution for Apache Hadoop
9 Resources
Links
Intel® Distribution for Apache Hadoop – http://hadoop.intel.com
Intel HiBench - https://github.com/intel-hadoop/HiBench/
Project Rhino – https://github.com/intel-hadoop/project-rhino/
Project Panthera - https://github.com/intel-hadoop/project-panthera/
Reference Architecture for Intel distribution of Hadoop -
http://hadoop.intel.com/pdfs/IntelDistributionReferenceArchitecture.pdf
Security without compromising performance -
https://hadoop.intel.com/pdfs/IntelEncryptionforHadoopSolutionBrief.pdf
Additional Whitepapers
Genomic Analytics - Next Bio - http://hadoop.intel.com/pdfs/IntelNextBioCaseStudy.pdf
Smart Energy Analytics - Pecan Street - http://hadoop.intel.com/pdfs/smart-energy-
analytics-pecan-street.pdf
Telco Analytics - China Mobile -
http://hadoop.intel.com/pdfs/IntelChinaMobileCaseStudy.pdf
Healthy City Analytics – China -
http://hadoop.intel.com/pdfs/IntelChinaHealthyCityAnalyticsCaseStudy.pdf
Smart City Video Analytics – Shanghai -
http://hadoop.intel.com/pdfs/IntelSmartCityVideoAnalyticsShanghaiIdealCaseStudy.pdf