Making a Case for Hadoop in Your Organization · With so much noise around Hadoop and big data, it can be difficult to find real answers on what the everyday uses are that make Hadoop

With so much noise around Hadoop and big data, it can be difficult

to find real answers on what the everyday uses are that make

Hadoop a viable component of the modern enterprise data

architecture. By understanding the uses, benefits and risks of

Hadoop in the enterprise, business and information technology

professionals can develop a plan for gaining acceptance of

implementing it in their organization.

A MetaScale Whitepaper

Making a Case for Hadoop in Your Organization

Author: Andrew McNalis

2

Making a Case for Hadoop in Your Organization | WHITE PAPER

Introduction Hadoop and big data have been all over the press lately. With all the hype, Hadoop does look like an attractive low cost solution that your company could leverage. So, you’d like to get approval to implement Hadoop within your company or organization, but you need to sell it to skeptical upper management. This whitepaper will provide tools to do just that by covering uses of Hadoop in the organization, benefits to the organization by using Hadoop, and the risks and risk responses. Additionally, a companion piece will outline points that you can use in a plan for selling Hadoop within your organization. Topics to be covered:

Uses for Hadoop in the Enterprise

How Hadoop Lowers Costs and Reduces Processing Times

Reliability of Hadoop in the Enterprise

Benefits to Implementing Hadoop

Impedances, Barriers and Road Blocks to Implementing Hadoop in the Enterprise

Risk Management of Hadoop in the Enterprise

The Plan (see companion piece) Through candid discussion of these topics, anyone who needs to obtain executive, management or funding approval should be able to make a compelling business case to implement Hadoop in the enterprise. Both business and information technology professionals will gain valuable insight into what Hadoop can do in an enterprise and how to leverage that information to get acceptance.

Uses for Hadoop in the Enterprise

Like most IT professionals, you probably receive some kind of promotional email about big data on a daily basis. You might find yourself asking: So, what’s it really all about? How can big data and Hadoop be used in the enterprise, specifically in my organization?

3


Let’s start with a couple definitions. Big Data: In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications (source: Wikipedia). Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications (source: Wikipedia). Unstructured vs. Structured Data Many of the published use cases for Hadoop involve analysis or processing of unstructured data. In most cases, traditional relational database management systems (RDBMS) are not effective tools to process unstructured data, especially large volumes of it. Processing large datasets with unstructured data is an area that Hadoop excels at. However, that does not mean it can’t process structured data. As you’ll see from topics later in this paper, many of the characteristics of Hadoop enable it to process structured data also. Processing Big Data The traditional use case for Hadoop is storing and processing large, complex datasets. This was the genesis for Hadoop, and this functionality has been proven many times. Enterprise Data Hub Many organizations have targeted data warehouse platforms for their enterprise data hub. In the past, these platforms were the best choice available. However, the traditional data warehouse platforms come with an expensive initial price tag along with ongoing maintenance and subscription costs. Plus, depending on IT budgets, the size of the data warehouse might be constrained due to costs, forcing transformation of the data, elimination of some data elements, and/or purging old data to save space. With Hadoop, these measures can be avoided, enabling an organization to keep all raw operation data potentially forever.

4


Hadoop as an Enterprise Data Hub

Data Archive (where the data is accessible) Many organizations have retention periods for data driven by compliance and legal requirement. In some cases, someone in the organization just thinks data might be valuable in the future so they take measures to keep it. But, where is the data usually kept? In Excel spreadsheets, Access, SQL Server and MySQL databases, on corporate shared drives, or individual’s machines hard drives? Each of these types of media has their own problems, but generally, accessibility by the greater organization is a problem. Additionally, space constraints drive behavior and actions that might not be the best course of action for the data.

5


In many cases, with regards to application specific data, it is kept on physical magnetic tape. Problems with tape include accessibility, location of the tapes, time it takes to retrieve the data, and reliability of the tape media. Using Hadoop as a low cost storage solution enables the organization to keep data for compliance and other purposes, and make that data available at any time. Analytics and Deep Analytics Hadoop can be used to analyze data using complex criteria. Deep Analytics refers to processing large quantities of data, data in the 100s of Terabyte to Petabyte range, for this analysis. Tools available include MapReduce (written in JAVA or Ruby), Pig, Hive, Hue, and a variety of commercial-off-the-shelf BI tools. The R open source statistical package can also be installed to leverage the parallel computing power of Hadoop. Operation Processing Operational Processing is a bit of a departure from the type of work that is usually targeted for the Hadoop environment. Where Hadoop got its start processing large amounts of data from web sites, it is also good at - and excels at - doing traditional IT batch processing. Examples follow. ETL Processes Extract, Transform and Load (ETL). Many companies spend large amounts of dollars on computing equipment and software to facilitate their ETL needs. Once the technical infrastructure is in place, the organization will need to develop scripts that read the inputs, perform the transformations and create the outputs. If organizations are not using formal ETL tools, than they’ve written the processes using programming languages. Pain points with using ETL environments include:

capacity issues (too much to process and not enough computing resources);

processing congestion and data latency;

up front and ongoing expense for this single use environment;

complex business logic coded into the proprietary language of the ETL tool;

locking the organization into this expensive technology solution for the foreseeable future;

6


data lineage (tracking back transformations of transformed data) - Hadoop addresses these problems inherently.

Ways Hadoop can ease ETL pain points include:

ability to keep growing your cluster because of Hadoop’s scalability and low cost hardware;

mitigated processing congestion and data latency issues by leveraging the parallelization of Hadoop;

reduced costs with commodity hardware, open source software and ease of scaling the environment;

code in easy to use PIG language.

Transformation of ETL to ELTx (T-T-T) with Hadoop

Mainframe and Distributed Batch Processing Hadoop excels at batch processing. By its very nature, Hadoop processes files sequentially, just like all batch processing systems. Whenever data is read in Hadoop, it is opening a file and reading the entire file sequentially. The Hadoop

7


framework handles the parallelization part “under the covers” and abstracts the developers and users from the parallelization layer. One of the major benefits to doing batch processing on Hadoop is speed. Because of the parallelization, the batch processes are 10 times faster than on a single threaded server or the mainframe. High Disk and CPU Consumption Applications and Grid Computing Because of the multiple machines and parallelization within Hadoop, and the local disk on each of the data nodes, Hadoop is an excellent environment to solve big data and heavy central processing unit (CPU) consumption type of problems. Once again, because of the nature of the Hadoop ecosystem, coders are insulated from needing to worry about the specifics of parallel processing and can instead focus on functionality.

Lower Costs and Reduced Processing Times

For new projects within your organization, Hadoop is a less expensive alternative than traditional computing equipment and software. Targeting traditional platforms (like Mainframe, robust servers, purchased software, purchased Distributed Relational Databases) is costly. In addition to initial up-front costs to purchase the products, most come with ongoing maintenance and subscription costs. With open source Hadoop software running on stripped down commodity hardware, expenses will be kept to a minimum. Hadoop is also being used for converting legacy applications to a lower cost, open source parallel computing model. There have been many successes converting COBOL and IBM BAL to Pig and Java based UDFs (user defined functions). The UDFs are used for the logic that is too complex for Pig. Converting mainframe and distributed batch processes to Hadoop reduces MIPS (million instructions per second), which leads to smaller mainframe footprint, reduces mainframe costs and reduces mainframe licensing costs (which are often tied to MIPS). Another area that Hadoop is excelling at is the ETL space. Using Hadoop as an enterprise data hub and keeping data at its most atomic level direct from its source system means that your base data is always in Hadoop. This base data

8


can be used for any of the transformations, and makes the data lineage very clear (as opposed to transforming data that has been transformed before or multiple times). Transformations can be coded in Pig and be run when needed. Because of the parallel processing power of Hadoop, these transformation jobs run very quickly, much faster than in a traditional ETL platform. Data warehouse analytics that involve large quantities of data, and do not require high speed SQL access and sub-second response time, also run extremely well on Hadoop. This strategy frees up computing resources (and maybe disk space) on the more expensive data warehouse platform for the kinds of applications that can easily run on a Hadoop platform. Massively Parallel Processing

Parallelization is a key benefit of Hadoop. Because of the parallel horse power in

a Hadoop cluster, your application has many computers working on the

problem simultaneously. Because of this parallelization, processing times are

dramatically reduced compared to sequentially processed, single threaded

environments. Processes coming from the mainframe, ETL and distributed

environments are usually single threaded.

Reliability of Hadoop Hadoop is a highly available fault tolerant environment. Because of having many machines as data nodes, local disk and three replicas of the data blocks, there are many options for where to process the data. In a Hadoop cluster, the data nodes are designed to fail (one power supply and one NIC card as opposed to two of each). The Hadoop framework is set up so a data node can fail and another picks up for it. When a data node fails and another picks up for it, the end user or batch process does not even know the node failed, and processing continues on another replica of the data in the cluster.

There are some single points of failure in the Hadoop ecosystem. Failure points

include the various master nodes, such as name node, secondary name node, job

tracker, Oozie and Hbase Master. If one of these machines fail, the users of

Hadoop overall, or that component will experience a disruption of server.

9


Because of this, the master machines are designed with redundant equipment to

make them more fault tolerant (dual power supplies and dual NICs).

Benefits to Implementing Hadoop in the Enterprise There are many direct benefits to using Hadoop… Cost Effectiveness. The first benefit is generally lower costs than traditional computing solutions. Open source software and low cost commodity hardware help drive the cost of a Hadoop installation down. Faster Processing. Parallel processing on multiple data nodes drives processing times down versus traditional solutions. Ease of Scalability. With Hadoop, you can add more servers into the cluster without an outage. Just add them in, update cluster configuration files, and Hadoop starts using them immediately. There is a rebalance option for redistributing the data. The rebalance process should be run on a periodic basis as part of your operational procedures. Fault Tolerance. Build in fault tolerance, data nodes designed to fail, three data replicas and beefy single point of failure servers for master servers all contribute to making a Hadoop environment fault tolerant, reliable and always up. Single Version of Truth. Combine operational processing on the same platform as your enterprise data hub. With the flexibility of Hadoop, your organization has the opportunity to create your single version of the truth data on Hadoop as an Enterprise Data Hub. Once data is in the Enterprise Data Hub, it can be consumed by others, including downstream operational processes. If you move operation processes to Hadoop, data brought down to Hadoop for the operations processes can be used to build and populate the Enterprise Data Hub. There are a few indirect benefits to implementing Hadoop also… Having a sizable Hadoop cluster enables your organization to keep data you never had room for previously. Hadoop gives the organization a place to put data that you struggle finding a place for. This could be old transaction data, or old web traffic data that you just did not have room for. Or maybe data that you think has value, but you do not have a specific need for now.

10


As mentioned earlier, a Hadoop cluster has a good deal of processing horse power. Perhaps there have been some problems that your organization just did not have the computing power to tackle in the past. With a Hadoop cluster and the available computing power, your organization can explore these possibilities now.

Impedances, Barriers and Road Blocks to Implementing Hadoop If your company is anything like most companies, there are usually a number of people that can say no, but only a select few that can autonomously say yes. In a lot of cases, there is no single person who can say yes, so your job is to convince all those people that can say no to say yes.

Response to Resistance

Resistance Issues Facts for Making a Case for Hadoop Implementation in Your Organization

Open source software is not enterprise ready

Hadoop has a robust Apache committee behind it, which controls its releases and software versions.

There are a number of Value Added Hadoop Issuers (VAHIs), including Cloudera, Hortonworks, Intel, IBM, MapR and Pivotal. Some of these VAHIs have open source instances of Hadoop that you can download and run for free. All of these VAHIs have Hadoop instances that are fully supported by their organizations for a fee structure.

Hadoop is not secure Presently, Hadoop does not have a strong security model. To secure the environment up to enterprise standards, MetaScale has designed a model that firewalls off direct access to Hadoop, which eliminates unauthorized access and enforces authorized user to proxy through a secure layer.

Additionally, the POSIX Linux file system model is used to restrict and grant access to users at the file system level, so only authorized users have access to data, and others are restricted.

We need to be able to turn to someone for support

There are a select few companies that provide Hadoop support through fee-based structured support models.

MetaScale is one such company that can provide a running, ready to use Hadoop cluster and talented experts to help jump start and sustain your Hadoop projects and applications.

Hadoop is not suitable for mission and time critical production work

With the fault tolerance built into the Hadoop infrastructure and ecosystem, and the stability of the software, Hadoop is perfect for mission critical workloads.

As mentioned earlier, a sizable Hadoop cluster processes work much faster than traditional computing platform solutions.

11


Hadoop cannot do everything a RDBMS does

This is true. But, it can do some of the things a RDBMS can do, at a dramatically lower cost.

So targeting the kind of work that runs well on Hadoop relieves workload on other more expense environments, freeing up capacity for other processes that need these environments.

Hadoop does not have the management tools like other platforms (mainframe)

There are management tools for Hadoop. Some are open source, and some come with a purchased instance from some of the major vendors.

There are tools that are specific to Hadoop and others that are more generic but will work with Hadoop environments.

Backup, recovery and disaster recovery are not very mature in Hadoop

Actually, there is a utility with Hadoop that facilitates backing data up via a copy.

In one MetaScale installation of Hadoop, data is copied (the deltas) over to a backup cluster daily. If a specific application predicates that their data be backed up more frequently, this can be facilitated as well. From the backup cluster, data is copied to a disaster recovery cluster in a different data center.

We do not have the skills, or Hadoop is not our core competency

Although the learning curve is steep, this is something an organization can adapt to. If you already have data warehouse users, this is a slight departure but many concepts translate.

With regards to technical skills, a base of Linux admin and Linux savvy resources is a great place to start.

There are a number of companies providing training for Hadoop, including Cloudera, Hortonworks and MetaScale.

MetaScale can provide a complete soup to nuts Hadoop solution, or any variation or components therein.

MetaScale offers a comprehensive Hadoop training curriculum to train your key resources for your ongoing success.

Risk Management When it comes to IT risk management, specifically, risk of system failures, organizations are usually protected by large and expensive contracts with big players in the IT industry. Hardware and software is purchased with support and maintenance. There are four ways to respond to risks: accept the risk event if it happens; take actions to avoid the risk event; mitigate the risk; deflect or transfer the risk (such as insurance or contracts with big IT players). Risk Acceptance This risk response strategy does not really require much effort. Wait for the risk event (system failure) to happen and then deal with the fall out. For the most

12


part, system failures are going to happen. When they do occur, we deal with them to resolve as fast as possible. Risk Avoidance This risk response strategy involves taking an action to avoid letting the risk event happen. This includes building redundancy into system hardware components, or coding self-healing subroutines into programs as much as possible. The master nodes in a Hadoop cluster are truly single points of failure; as a result, there is a lot of component redundancy built into these servers to avoid machine failure. Risk Mitigation Risk mitigation is taking action to reduce the probability that a risk event will happen, or reduce the impact if a risk event does happen. Because of the fault tolerance built into Hadoop, much of this has already been handled. Risk Transfer This risk response strategy involves transferring the risk to others (other organizations) by means of contracts, making those third parties responsible for dealing with the failures. Hardware failure can be dealt with by purchasing expensive hardware support. For the master nodes in a Hadoop cluster this is a good idea, but the data nodes are redundant and fault tolerant already so there is no need. Hadoop software support can be purchased from companies like Cloudera, Hortonworks and others. For Hadoop implementation, your organization can transfer the risk to MetaScale. MetaScale provides a fully managed Hadoop environment.

>> NEXT STEP: The Plan (see companion piece)

Contact MetaScale for More Information

Toll-free: 1-800-234-8769 | Email: [email protected]

mailto:[email protected]

13


About the Author

Andrew McNalis is a Hadoop Infrastructure Manager at Sears

Holdings Corporation. Andy has been a leading member of the team

that builds, deploys and manages an enterprise-scale Hadoop

platform at Sears. Part of MetaScale’s Hadoop Center of Excellence,

Andy is involved in the development of design best practices for

Hadoop.

MetaScale provides technology, talent and solutions to

help enterprises accelerate their big data efforts and

generate value from their data.

Part of the Sears Holdings family of companies, MetaScale

offers end-to-end services for Hadoop and big data. We

leverage our Fortune 100 heritage and experience of

applying Hadoop technology to the enterprise to give our

customers practical big data solutions.

Visit us:

www.metascale.com

© 2013 Sears Brands LLC. All Rights Reserved

http://www.metascale.com/

Documents

Making a Case for Hadoop in Your Organization · With so much noise around Hadoop and big data, it can be difficult to find real answers on what the everyday uses are that make Hadoop