46
Hadoop Data Management Platfomms Maoket Segmentatfn and Pofduct Pfmitfning BI Leadership BI Technology Guide By Cflin White July 2013

Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platfomms Maoket Segmentatfn and Pofduct Pfmitfning BI Leadership BI Technology Guide

By Cflin White July 2013

Page 2: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

BACKGROUND

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 2

Backgofund THIS REPORT PROVIDES AN INTRODUCTION TO HADOOP and discusses the current state of the Hadoop data management platorm marrett t suggests criteria for selectng a Hadoop data management platorm and rerieis leading Hadoop solutons from a rariett of rendorst WHAT IS HADOOP? An open source softwre librart and frameiorr from the Apache Sofiare oundaton (AS ) that supports the distributed processing of large data files across clusters of computerst WHAT IS OPEN SOURCE SOFTWARE? Licensed sofiare source code ihere the copyright holder gires the rights to study, change and distribute the sofiare for free to anyone and for ant purposet Note that an open source sofiare distributon can be used for commercial purposes prorided the terms of the original license are not riolatedt WHAT IS A HADOOP DATA MANAGEMENT PLATFORM? The term used in this document to describe a collecton of integrated components that enablee and possiblt enhance and extende a Hadoop distributed processing enrironmentt Such a platorm mat consist solelt of sofiare or mat be a combinaton of both sofiare and hardiaret The sofiare mat be 110 open source or mat hare both open source and commercial componentst

Cfmpfnentm ff Impfotance tf Hadffp Data Management There are mant open source components associated iith Hadoop and related AS projectst Summarized beloi are the main components of importance to a Hadoop data management platormt

Apache Hadffp Dimtoibuted File Symtem (HDFS)s A distributed file ststem for storing and processing large datasets across a cluster of machinest HD S is designed for a irite-once and read-many style of processing—it is not intended for random read-irite operatonst Apache Hadffp MapReduces A programming model for the processing of large data files in parallelt Apache Hadffp YARNs A resource-management and distributed applicaton processing frameiorr that enhances MapReduce (sometmes referred to as

t At the time of iritinge Apache YARN ias arailable as an alpha preriei releaset

Hadoop is an open source software liawarewndefawmrtoakefrom the Apache SoftwareFoundwtone tw esupports the dls aliu rdeprocessing of wagredw wefi rsewcaossec us raseofecomputers.

Page 3: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

BACKGROUND

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 3

MapReduce 2t1) and extends Hadoop to support programming models and applicaton iorrloads other than MapReducet Apache Hives A tool that prorides a SQL-lire declaratre language (HireeQ) for processing data in Hadoop-compatble file ststemst Hire compiles HireeQ into MapReduce jobs before thet are run on the Hadoop ststemt Apache Pigs A tool that prorides a high-lerel procedural data analysis language (Pig Qatn) for processing data in Hadoop-compatble file ststemst Pig compiles Pig Qatn applicatons into MapReduce jobs before thet are run on the Hadoop ststemt Apache Mahfuts A data mining and machine learning librartt The algorithms in this librart are primarilt dereloped using the MapReduce programming modelt Apache HBames A nonrelwtonwl2 distributed database management system (DBMS) for Hadoop, modeled afer ooglees igTable projectt The DBMS capabilites prorided bt H ase are more suited to random read-irite iorrloads compared iith HD Se ihich is designed primarilt for batch sequental processingt3 Apache Sqffps A tool for the bulr transfer of data betieen Hadoop and external systems (such as a relatonal D MSe for example)t Apache HCatalfgs An Apache incubator project4 that extends Hire to proride a common metadata abstracton later for data created bt Hadoop components such as MapReducee Hire and Pigt Apache Ambaois An Apache incubator project that prorides a ieb-based tool for prorisioninge managing and monitoring Hadoop clusterst Apache Lucenes A full-text search engine librart iriten in arat Apache Sflos A ieb search platorm (serrlet container) built around Qucenet Otheo Apache fpen mfuoce cfmpfnentm ofen incorporated into a Hadoop data management platorm includee Arro (clienttserrer serializaton protocol for

2t The terms NoSQL or NetSQL are also often used, but nonrelwtionwl is a more accurate

termt 3t There are many other nonrelational ststems arailable for Hadoope including columne

document and graph storest These are not ttpicallt arailable as a component of a Hadoop data management platform, but as add-on productst Thet are not discussed further in this papert Visit nosql-dwtwbwse.org for a list of examplest

4. Code donated bt external organizations to Apache is introduced initiallt to the open source communitt through an incubator projectt

Apache Hive is a oo e tw epaovldrsea SQL- lkredrc wawtvre wnguwgrefoaeprocessing Hadoop data.

Page 4: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

BACKGROUND

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 4

exchanging and persistng data)e huria (for monitoring large distributed ststems)e lume (ieb log data collecton and aggregaton)e ,ozie (iorrroi orchestraton and scheduling) and ZooKeeper (coordinaton serrice for distributed applicatons)t

Page 5: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 5

MARKET EV,QUT ,N Maoket Evflutfn A Litle Himtfoy The seeds of Hadoop hare their origin in the Nutch open source search engine dereloped bt Doug utng and Mire afarellat This project began in 2112 and erolred to incorporate sereral ideas dereloped bt ooglee including the oogle ile Ststem (on ihich HD S is based) and MapReducet Afer Doug utng ient to iorr at Yahoo in 2116e the basic underlting infrastructure of Nutch became the Hadoop (named afer utnges sones stuffed elephant) open source project at the Apache Sofiare oundatont Yahoo steadilt improred the scalabilitt of Hadoop; and bt 2118 it ias used not onlt to build the Yahoo ieb indexe but also to run mant of its batch analttc processest The tear 2118 also sai former engineers from ooglee aceboor and Yahoo form louderae a compant focusing on Yahoo sofiare solutonst Doug utng joined loudera in 2110t ,rer the next fei tears the use of Hadoop steadilt greie especiallt in companies needing to process large rolumes of ieb datat This groith led Yahoo in 21 to form a separate compante Hortoniorrse ihich focused on Hadoop-related sofiaret Hadoop is changing rapidly; and as the functonalitt of Hadoop improrese so too does its use in mainstream enterprisest IT groups, for example, are using Hadoop as a cost-effectre data refinert for collectnge managing and filtering increasing rolumes of data for use bt doinstream enterprise applicatonst T is also using it for data archiringt usiness unitse on the other hand, are emploting Hadoop for standalone analttcal applicatons that process and analtze large rolumes of datae especiallt mult-structured data such as ieb logse ststem logse social media datae emaile and netiorr and sensor datat

Hadoop Aochitectuoe Apache Hadoop supports a distributed computng frameiorr that is designed to handle the processing of large rolumes of mult-structured dwtwt5 This data is distributed across a hardiare cluster consistng of dozense potentallt hundredse of loi-cost hardiare serrers and processed in parallel bt replicatng applicaton processing on multple nodes of the clustert This parallel processing approach is especiallt beneficial for iorrloads that need to sequentallt

5t This paper uses the term multi-structured dwtw rather than unstructured dwtw since most data has some form of structuret

Hadoop is ctwnglngeawpld r;eand as the functonw l reofeHadoop improves, so too does its use in mainstream enterprises.

Page 6: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 6

MARKET EV,QUT ,N process a small number of rert large data filese such as a data transformaton or a data mining iorrloadt n additon to the core serrices that support the orerall Hadoop enrironmente the Apache Hadoop librart includes tio ret components for enabling distributed processinge the Hadoop Distributed ile Ststem (HD S) and the MapReduce (MR) programming modelt HD S is used to store and manage data, ihile the MR programming model isolates the dereloper from needing to rnoi hoi to irite distributed applicatons for processing that datat oth HD S and MapReduce are designed so that machine node and storage derice failures are handled automatcallt bt the sofiare frameiorrt HD Se for examplee replicates data to increase data arailabilittt MR programs are ttpicallt iriten in arae but other programming languages can also be usedt Higher-lerel programming tools are arailable for dereloping Hadoop applicatonst Tio popular ones are Apache Hire and Apache Pigt The R language is also usedt

Common Chaoacteoimtcm and Toendm ncreasing use and interest in Hadoop has led to significant derelopment efforts bt commercial rendors to enhance and extend the Apache Hadoop frameiorre and offer a range of Hadoop data management platormst Although platorms rart bt rendore there are some common characteristcs and trends that applt to all platormse

t CORE COMPONENTS BASED ON OPEN SOURCE SOFTWARE. All platorms include Apache Hadoop core serricese HD S and MapReducee plus Apache Hire and Apache Pigt Mant also include the H ase D MSt

2t MASSIVELY PARALLEL PROCESSING ON LOW-COST HARDWARE. A ret benefit of Hadoop is its abilitt to support massirelt parallel processing (MPP) on loi-cost hardiaret Sereral fault-tolerant features are built into the sofiare to compensate for the lirelt higher failure rates of this hardiaret Hadoop can also be deploted on a desrtop computer or SMP serrer for eraluaton and derelopment purposest The nodes of a Hadoop MPP cluster generallt use standalone or racr serrers iith direct-atached storage (DAS)t This shared-nothing architecture is east to scale out to satsft groith needst t also prorides data localittt ,ther hardiare configuratons are possiblee including the use of blade serrers and shared storage architectures (such as a storage area netiorre for example)t These later optons mat be more didcult

Increasing use and interest in Hwdoopetwse rde oeslgnlficwn edrvr opmrn erffoa seirecommraclw evendors to enhance and extend the Apache Hadoop fawmrtoak.

Page 7: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 7

MARKET EV,QUT ,N

to scale out and may also increase the cost of the ststemt

3t MULTIPLE DEPLOYMENT PLATFORMS. Hadoop products run on Linux, but support for Microsof indois is arailablet Some rendors offer optmized rersions of Hadoop for running as a rirtual machinet The ability to operate Hadoop in a cloud-computng enrironment is a groiing trendt There is also increasing interest in deploting Hadoop in the open source ,penStacr cloud operatng enrironmentt

4t MULTIPLE APPLICATION INTERFACES. The directon of Hadoop rendors is to proride three ttpes of applicaton interfaces to Hadoop datae procedurwl progrwmming lwnguwge ( arae other MR-supported languagese Pige R)e declwrwtie uerr lwnguwge (Hire and other SeQ laters) and sewrch (Apache QucenetSolr and other proprietart search engines)t ,ne area that is receiring significant atenton and derelopment effort is the additon of an SeQ later on top of the Hadoop enrironmentt This has sereral benefitst ,ne of the ret ones is that it extends Hadoopes programmatc MR model to support SeQ-based toolse ihich in turn enables non-programmers, such as business analysts, to access Hadoop datat Depending on hoi the SeQ later is implementede it can also extend Hadoopes batch-oriented iorrload model to enable a more wd hoc sttle of processingt There aree hoierere some important implicatons and issues associated iith using interactre SeQ interfaces iith Hadoopt These are discussed in detail in the “SQL on Hadoop” secton of this reportt

5t DATA INTERCHANGE WITH OTHER ENTERPRISE SYSTEMSt or Hadoop to be successful in mainstream enterprises, it must be able to exchange data iith existng T ststemst The Apache Sqoop project prorides bulr data transfer betieen Hadoop and other data ststemse but its capabilites are someihat limitedt This is iht mant rendors hare dereloped their oin data transfer solutonst These solutonse hoierere rart significantlt both in functon and in performancet Sereral relatonal D MS rendors ( Me ,raclee Teradatae HP Vertcae Actan ParAccele for example) proride enhanced bulr data transfer features that exploit the parallel computng capabilites of both source and target ststems to improre data transfer performancet These transfer operatons are ofen done using Hadoop MR programs that are inrored using SeQ functon calls in relatonal D MS applicatons and scriptst

One area that is receiving slgnlficwn ewtrntonewndedrvr opmrn erffoa else trewddltoneofewneSQLe wrraeone opeof the Hadoop environment.

Page 8: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 8

MARKET EV,QUT ,N

Data integraton rendors ( nformatcae Actan Perrasiree Pentahoe Syncsort, Talend and SASe for example) extend bulr data transfer iith the abilitt to transform the data before it is transferredt This transformaton is frequentlt achiered using MR programst

6t EVOLVING DATA MANAGEMENT TOOLS. Hadoop to date has offered litle in the iat of the data management tools that are a ret feature of most enterprise ststemst Examples of requirements here include data loadinge bacrup and recorerte disaster recorerte securitte data transformaton and integratone data qualitt managemente metadata managemente administraton and tuningt Some inital steps are noi being taren to satsft these requirementst ,ne of these is the Apache H atalog facilitte ihich prorides a common interface to Hiree Pig and HD S metadatat Sereral rendors are also beginning to offer their oin tools for data managemente and these tools iill become increasinglt more important as the use of Hadoop contnues to groit

7t REMOVAL OF HADOOP SINGLE POINTS OF FAILURE (SPOF) AND PERFORMANCE BOTTLENECKS. There are sereral Hadoop SP, s (Hadoop NameNode and obTracrer serrice) and performance limitatons (scheduling and iorrload managemente quert performance of Hire and HD S)e and both Apache and Hadoop rendors are iorring to orercome these issuest Mant of the rendor efforts in this area are documented in the “Vendor Assessments” secton of this documentt

8t EVOLVING SYSTEM MANAGEMENT TOOLS. Mant companies find Hadoop didcult to install, administer, tune and maintaint This is not onlt due to a general lacr of toolinge but also because these companies donet hare the required srillst Vendor Hadoop data management platorms help simplift materse especiallt in the area of installatone and most Hadoop rendors also offer consultng and support serricest There ise hoierere stll a lacr of enterprise qualitt ststem management tools for Hadoope and sereral open source and rendor projects are in the iorrs to help solre thist An example of such a project is Apache Ambarit As the use of Hadoop in enterprises grois and competton increases, ststems management iill become a ret rendor differentatort ,ne of the driring forces behind Hadoope hoierere is to reduce the cost of T ststems and data managemente and the inrestgatre nature of mant Hadoop applicatons means that sophistcated systems management tools are not aliats requiredt

As the use of Hadoop in rn rapalsrsegaots wndecomprttoneincarwsrs,esrs rmsemwnwgrmrn etl eircomrewekrrevendor dlffrarntw oar

Page 9: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 9

MARKET EV,QUT ,N

0t INCREASING NUMBER OF PARTNER RELATIONSHIPS. Hadoop has become closelt associated iith the present industrt focus on the concept of big dwtwt The ret to realizing inrestments in big data technologies is hoi to lererage this data for business adrantaget ood derelopment tools coupled iith east-to-use business tools and pacraged applicatons iill plat an important role heret This is iht mant Hadoop rendors are partnering iith tools derelopers and third-partt rendors to build out a repertoire of T and business user capabilitest ,f course, the lerel to ihich this can be achiered iill be dependent on the ability of the underlying Hadoop data management platorm to support the functons required iith good performancet

SQL fn Hadffp As discussed earliere the additon of an SeQ later on Hadoop offers sereral benefitst The degree to ihich these benefits can be realized is dependent on hoi this later is implementedt This secton tares a detailed loor at this important and someihat controrersial topict A beter understanding of the pros and cons of the rarious SeQ laters arailable for Hadoop can best be gained bt first examining the role SeQ in data managementt SeQ ias one of the first commercial languages to support Drt Et t oddes relatonal modelt SeQ coupled iith relatonal tables and rieis gare derelopers for the first tme a standard data access language and a logical riei of data that is independent of the iat data is phtsicallt storede accessed and managedt This not onlt improres usabilitte but also enhances portabilitt and interoperabilittt One of the ret components of a relatonal ststem is the quert optmizert The optmizeres job is to proride phtsical data independence and to determine the most appropriate iat to phtsicallt access and process datat The optmizer plats a major role in the performance of the ststeme and significant amounts of research and derelopment hare gone into designing edcient optmizer technology and integratng this technologt at run tme iith iorrload managementt ,ptmizer qualitt and extensibilitt and sophistcated iorrload management are essental if a product is to process large amounts of data and complex iorrloads iith good performancet ,ne of the first SeQ laters dereloped for Hadoop ias Hiree ihich presents data to derelopers in the form of tablest The data in these tables is accessed and manipulated using HireeQ quert statementst Hire does not hare a sophistcated quert optmizere but instead uses a quert compiler and a set of rules to conrert HireeQ queries into a series of MR batch jobs for executon on a Hadoop clustert The compiler has limited rnoiledge about the phtsical

MwnreHwdoopevendors are pwa nralngetl te oo sedrvr opraseand third-pwa revrndoase oeiul deout a repertoire ofeITewndeiuslnrsseusraecwpwil ltrsr

Page 10: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 10

MARKET EV,QUT ,N locaton of data in HD S filese and hints are used in HireeQ queries to aid the performance of table join operatons and so fortht HireeQ supports a subset of the SeQ-02 standardt An INSERT statement is proridede but it can only be used to load or replace a complete table or table parttont The equiralent of the SeQ UPDATE and DELETE statements are not supportedt Hire also offers additonal capabilites orer and abore the SeQ standardt The MAP and REDUCE operators, for example, can be used to embed custom MR scripts in HireeQ queriest The primart use case for Hire is the same as that for MR applicatonse ihich is the sequental processing of rert large data files such as ieb logst t is not iell suited to wd hoc queries ihere the user expects fast response tmest The main benefit of Hire is that it dramatcallt improres the simplicity and speed of MR derelopmentt The Hire compiler also mares it easier to process interrelated files compared iith hand-coding MR procedural logic to do thist To orercome the limitatons of Hire and to enable faster quert performancee sereral rendors are building enhanced SeQ laters on top of Hadoopt These SeQ laters emplot a rariett of techniquest Some of the main ones are outlined beloie

t mprore the functonalitt and performance of Hire (Hortoniorrs and ntel)t

2t Add an SeQ later that btpasses Hire and MapReduce and accesses the Hadoop data directltt The doinside of this approach is that the dereloper loses the poier of MR processingt or this reasone this approach complements Hire and MapReducee rather than replaces them (Apache and loudera)t

3t Derelop nei on-disr andtor in-memory Hadoop data handlers and data formats that are more suited to wd hoc query processing (Apache, Clouderae Hortoniorrse Me ntel and Pirotal)t

4t uild a nei SeQ quert engine running on Hadoop that uses a quert spliter to route SQL query fragments to one or more underlying data handlers (HD Se H asee relatonale search indexe etct) to access and process the data (Hadapt and M)t

Table shois examples of leading SeQ derelopment projects for Hadoope the techniques used to create the SQL layer, and the types of Hadoop data handlers and data formats supportedt These projects are rerieied in more detail in the “Vendor Assessments” secton of this reportt These SQL layers are in rarious

Ttremwlneirnrfi eof Hive is that it dawmwtcw reimproves the slmp lcl rewndespeed of MapReduce drvr opmrn r

Page 11: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 11

MARKET EV,QUT ,N stages of derelopment and mat onlt be arailable for testng purposes in some casest Alsoe it must be realized that it has taren RD MS rendors mant tears of research and derelopmente and significant derelopment resources, to build high-performance SeQ optmizers and data handlerst t is unreasonablee thereforee to expect the relatrelt immature SeQ laters in Hadoop data management platorms to proride the same lerel of functonalitt and performance, especiallt ihere complex SeQ iorrloads are requiredt

Twi re1:ePaojrc se tw ewddewneSQLe wrraeoneHwdoop

SQL Layeo Pofvideo

Pofject Name

Techniquem Umed

Data Handleom and Data Ffomatm Suppfoted

Apache Hire - MR jobs to HD S

Cloudera mpalatParquet 2, 3 HD Se H asee Parquet

Hortoniorrs Stinger e 3 Hiree ,R ile

Hadapt nteractire euert 4 HD Se relational data

IBM BigSQL 4, 3 Hiree enhanced H asee relational data, …

MapR Drill (based on oogle Dremel)

2, 3 HD Se H asee MapR- Se …

Intel Panthera e 3 Hiree enhanced H ase

Pirotal Adranced Database Serrices (HA e)

2, 3 Pirotal HD to HD Se H asee relational HD S data

Cumtfmeo Ume Camem Although mant companies are stll trting to figure out the role of Hadoop in the enterprisee the marret for Hadoop is nerertheless groiing rapidlt and some clear customer use cases are noi apparentt These use cases fall into tio main ttpese those supported by IT and those driren bt indiridual business unitst T groups in mant organizatons are struggling to cost-effectrelt manage increasing amounts of data, and one of the main use cases for Hadoop in IT is to use it to store and managee and also potentallt transforme large rolumes of data at a loier cost than iith existng ststemst This use case has rarious names, including dwtw lwnding wrew, dwtw hub and dwtw refinerrt Regardlesse cost sarings are achiered bt proriding the abilitt toe

Ttremwakr efoaeHadoop is gaotlngeawpld rewndesomrec rwaecustomer use cwsrsewarenoteapparent.

Page 12: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 12

MARKET EV,QUT ,N

• STORE MORE DETAILED OPERATIONAL DATA IN HADOOP FOR A LONGER PERIOD OF TIME. Cost limits the amount of detailed data that can be rept in a data iarehousee and so in this scenario Hadoop is used to store all the detailed data required for analtsist Then subsets of this data are staged into the data iarehouse and other business intelligence (BI) systems as requiredt Retail companiese for examplee are noi storing mant tearse iorth of detailed customer data in Hadoop and copting just the more recent data into the data iarehouse for analtsist

• USE HADOOP AS A DATA ARCHIVE. Data that is no longer required in a data iarehouse can be archired in Hadoopt This is useful to satsft statutort data retenton requirements and also mares the archire arailable for quert purposes if the need should ariset

• USE HADOOP AS A DATA STORE FOR MANAGING AND TRANSFORMING NEW SOURCES OF DATA. Many companies are interested in extending their ststems to handle nei ttpes of datat This data mat range from ieb and social computng data to netiorr logs and data from sensor netiorrst Hadoop can be used as a cost-effectre EQT engine to extract, load and transform this data and then more the results into a data iarehouset

rom a business unit perspectree the main interest in using Hadoop is not onlt for managing and transforming datae but also for analtzing itt Use cases here are usuallt focused on specific line-of-business (Q, ) needs and on analtzing nei ttpes of data from a limited number of high-rolume data sources— aceboore Tiitere ieb logse netiorr logse call detail recordse claim datae call center logse external neis feeds and so fortht Mant of these applicatons are customer facing and inrestgatre in naturet The analttcal processing mat be used to identft additonal data for use in a data iarehousee to expand existng business analttcs or to improre a predictre modelt t mat also result in a completely nei analttcs-driren business applicatont There are a iide rariett of other use casese but the dominant ones that surfaced in interrieis for this report are those outlined aboree reducing T costs by using Hadoop as a data refinert or archire and using Hadoop as an inrestgatre platorm for analtzing specific sources of data that hare large data rolumest

Hwdoopecwneireused as a cost-rffrctvre LTeengine to extract, owdewndetransform data and then move trearsu seln oewedw wetwartousrr

Page 13: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

TYPES , HAD,,P DATA MANA EMENT PQAT ,RMS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 13

Typem ff Hadoop Data Management Platfomm So far this document has defined Hadoop data management platorms, described their common characteristcs, explained hoi the marret is erolring and rerieied some ret customer use casest This secton helps tou differentate among Hadoop data management platorms so you can create a shortlist of rendors that mare sense for tour organizatont ommercial rendors tare the Hadoop frameiorr and library from Apache, enhance it and mare it arailable in one or more of the folloiing formse

• Open source Hadoop mfftaoe dimtoibutfn that contains the main components required to deplot a Hadoop enrironmentt The open source components in the distributon are tested to ensure thet iorr together as a single integrated ststemt

• Enhanced Hadffp mfftaoe platfom that includes the components of

a Hadoop sofiare distributon plus additonal open sourcee and possiblt commercial componentse that proride enhanced data and system management capabilitest Some of these platorms can be configured for use in a rirtual machine enrironmentt

• ntegrated hardiare and sofiare Hadffp appliance that prorides a

single ststem containing a Hadoop sofiare platorm integrated iith a hardiare reference platorm and optmized for a Hadoop processing enrironmentt n some cases the sofiare supplier mat also manufacture the hardiaree ihile in other situatons the hardiare mat come from one or more third-partt proriderst

• Hadffp clfud meovicem fffeoing that prorides Hadoop serrices for use

in a public andtor prirate cloud operatng enrironmentt

• Hadoop DBMS fo analytc meoveo mfftaoe that runs on each node of a Hadoop hardiare cluster and exploits Hadoopes parallel processing capabilites to proride high performancet This sofiare mat support one or more Hadoop sofiare platormst The sofiare is usuallt pacraged and delirered iith the Hadoop sofiare platorme but this is not true in all casest

Commraclw evrndoase wkre treHadoop fawmrtoakewnde liawarefaomeApache, enhance it wndemwkrel ewvwl wi relneonreor more of srvraw eforms.

Page 14: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

TYPES , HAD,,P DATA MANA EMENT PQAT ,RMS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 14

Companies such as louderae Hortoniorrs and MapR offer both Hadoop sofiare distributons and Hadoop sofiare platormst These companies generate their rerenue through educaton and consultng and support serricese and also, in some cases, by charging license fees for additonal components supplied iith a Hadoop sofiare platormt n generale the sofiare distributons from these companies are more appropriate for enterprise purposese compared iith integratng the rarious Hadoop components arailable directly from Apachet Mant enterprise hardiare rendors (EM Pirotale HPe Me ntele ,racle and Teradata) marret Hadoop sofiare platorms andtor Hadoop appliancese ihile sereral enterprise sofiare rendors (Microsof) proride Hadoop sofiare platorms that can be deploted on multple hardiare platormst ,ther enterprise sofiare rendors offer relatonal D MS (Hadapt) or analttc serrer (SAS) sofiare that runs natrelt on a Hadoop ststemt Mant of these rendors iorr iith Hadoop sofiare distributon rendors to proride the underlting Hadoop open source componentst Some products support onlt a single Hadoop distributone ihile others alloi a choice of distributont Table 2 lists some of the main rendors and the ttpes of products thet proridet More detailed informaton about these products can be found in the “Vendor Assessments” secton of this reportt Sereral cloud computng companies—Amazon and Racrspacee for example—offer cloud-computng serrices based on Hadoopt Note also that some Hadoop sofiare platorms can run as a rirtual machinee ihich enables them to be used in a cloud enrironmentt Table 3 shois examples of rendors offering cloud serrices and rirtual machine solutonse but this list is not exhaustret

Mwnrern rapalsretwadtwarevrndoasemwakr eHwdoopesoftwarep wtoamsewnddoaeHadoop wpp lwncrs,ettl resrvraw ern rapalsresoftwarevrndoaseprovide Hadoop softwarep wtoamse tw ecwneiredrp orrdeonemu tp retwadtwarep wtoamsr

Page 15: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

TYPES , HAD,,P DATA MANA EMENT PQAT ,RMS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 15

Twi re2:eVrndoasetl teHwdoopesoftwar-on reso utonsewndewpp lwncrs

Vendfo Pofduct Type

Cloudera CDH Cloudera Standard and Cloudera Enterprise

St distribution St platforms

Hadapt Hadapt Adaptire Analttical Platform D MS serrer

Hortoniorrs Hortoniorrs Data Platform Hortoniorrs Data Platform for indois

St platform St platform

HP HP Reference Architecture for Cloudera, Hortoniorrs and MapR

HP AppSystem for Apache Hadoop

St platforms Appliance

IBM IBM InfoSphere BigInsights Basice euicr Start and Enterprise editions

IBM Puredata System for Hadoop

St platforms Appliance

Intel Intel Distribution for Apache Hadoop Softiare

St platform

MapR MapR M3, M5 and M7 editions St platforms

Microsoft Microsoft HD nsight Serrer for indois St platform

Oracle Oracle Big Data Appliance Appliance

Pirotal (EM ) Pirotal HD ommunitt & Enterprise editions

Pirotal Data omputing Appliance

St platforms Appliance

SAS SAS LASR Analttic Serrer Analttic serrer

Teradata Teradata Aster Big Analytics Appliance Teradata Appliance for Hadoop Teradata Commodity Configuration for

Hadoop Teradata Softiare-Only for Hadoop

Appliance Appliance Ht & St platform St platform

Page 16: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

TYPES , HAD,,P DATA MANA EMENT PQAT ,RMS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 16

Twi re3:e xwmp rseofevrndoasetl teHwdoopec oudesravlcrs wndevla uw emwctlnreso utons

Vendfo Pofduct Type

Amazon Amazon eb Serrices Elastic MapReduce Public cloud

Hortoniorrs Hortoniorrs Sandbox Virtualization

Microsoft indois Azure HD nsight Serrice

HD nsight Serrer for indois

Public cloud

Virtualization

Pirotal (EM ) Pirotal HD iith Hadoop Virtual Extensions and Project Serengeti

Virtualization

Racrspace Racrspace Prirate loud (poiered bt ,penStacr)

Prirate cloud

Page 17: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

EVAQUAT ,N R TER A

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 17

Evaluatfn Coiteoia BEFORE CHOOSING A HADOOP DATA MANAGEMENT PLATFORM, there are many criteria tou need to considert The list beloi serres as a startng point for a more thorough eraluatone

t HADOOP PLATFORM. hat ttpes of Hadoop platorms does the rendor supply? hich Hadoop distributon is used bt each platormr hat Hadoop components are includedr hich operatng ststems are supportedr an the platorm operate in a rirtual machine enrironmentr an the platorm operate in a prirate or public cloudr hich rirtual machine and cloud-computng enrironments are supportedr hat is the pricing model and cost of the platormr s a free eraluaton rersion arailabler

2t HARDWARE. s the Hadoop platorm pacraged as a hardiare and sofiare appliancee or can buters select their oin hardiarer Does the rendor hare recommended hardiare reference architectures for the platormr hat hardiare suppliers and ststems are supportedr Does the rendor integrate and test the hardiare and sofiare at the factort before delirertr

3t OPEN OR PROPRIETARY. Are all the sofiare components of the Hadoop open source? Does an open source community support the platormr Are there ant sofiare components in the platorm that iould cause rendor locr-in?

4t SERVICES. hat traininge consultng and support serrices does the rendor offer for the Hadoop platormr hat is the cost of these serricesr Does the rendor or a third-partt compant proride these serricesr Are customer references arailable for these serricesr

5t APPLICATION DEVELOPMENT INTERFACES AND TOOLS. hat applicaton interfaces and programming languages are supported bt the Hadoop platormr Are applicaton derelopment tools prorided to aid dereloper productrittr an third-partt derelopers use the interfaces to extend the platorm iith additonal applicaton derelopment and end-user capabilites and toolsr Are derelopment rits arailable that aid derelopers in extending the platormr s there a certficaton program for these third-party extensions?

6t SQL CAPABILITIES AND PERFORMANCE. Does the Hadoop platorm proride or support an SQL layer? hat techniques iere used to build this later (see Table )r hat data handlers and data formats does the

Before choosing a Hadoop data management p wtoam, there waremwnr criteria rouenrrde oeconsider.

Page 18: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

EVAQUAT ,N R TER A

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 18

SeQ later supportr Does the SeQ later adhere to the Entrt Qerel SeQ-02 Standardr f note ihich SeQ capabilites are not supportedr hat SeQ capabilites are supported in additon to SeQ-02r Does the SeQ later employ a rules-based or statstcs-driren optmizerr Does the optmizer generate Hiree Pig or MR coder Does the optmizer reirite queries to improre optmizatonr an the dereloper add SeQ hints to aid optmizatonr Are there performance benchmarrs arailable that document the performance of the SQL layer?

7t DATA MANAGEMENT TOOLS. hat data management tools does the Hadoop platorm prorider hat capabilites are prorided for data loadinge bacrup and recorerte disaster recorerte data securitte data transformaton and integratone data qualitt management, metadata managemente administraton and tuning?

8t SYSTEM AND WORKLOAD MANAGEMENT TOOLS. hat systems and iorrload management tools does the Hadoop platorm proride? hat capabilites are prorided for installatone scheduling and iorrload managemente ststem and disaster recorerte monitoring and tuninge ststem securitte administraton and system maintenance?

0t EXTERNAL DATA SYSTEM INTERFACES AND ADAPTERS. hat interfaces and adapters does the Hadoop platorm proride for exchanging data and interoperatng iith external data ststemsr Do these adapters exploit the parallel processing capabilites of Hadoop and the external data ststemsr s benchmarr data arailable shoiing the performance of these interfaces and adapters? Can third-party derelopers add their oin adapters for unsupported data ststemsr s there a certficaton program for these third-party adapters?

1t THIRD-PARTY TOOLS AND APPLICATIONS. hich third-party data integraton ( nformatca and Talende for example) and business user analysis tools and applicatons are supported (Datameer, Karmasphere, Pentaho, SAS and Tableaue for example)r Does the rendor support a marretplace of third-party add-on products?

t MATURITY. Hoi mant tears has the product been deplotedr Hoi mant actre customers and partners does the product harer Are customer references arailable for the platormr

Page 19: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

RECOMMENDATIONS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 19

Recfmmendatfnm BEFORE PURCHASING A HADOOP PLATFORMe there are sereral things tou should considere

t Identfy the typem ff dataa applicatfnm and tfoklfadm yfu timh tf deplfy fn Hadffp. The startng point for ant project is not technology selectone but identfting business requirementst In the case of IT, the business case for Hadoop is ofen to reduce the costs of storinge managing and transforming datat or the businesse most requirements are specific to the analttc processing needs of a partcular business areat Regardless of ihether the requirements come from T or the businesse the first tasr is to identft the sources and ttpes of data required to satsft requirements and to ascertain the processing to be performed on that datat This later informaton enables project managers to eraluate ihich Hadoop platorms are capable of proriding the required functons and performance and to determine the hardiare and sofiare needed to support the projectt

2t When deteomining the cfmt ff the Hadoop data management platfoma calculate the tftal cfmt ff ftneomhip ffo the mymtem. Many customerse ihen comparing or purchasing a Hadoop ststeme simplt calculate the cost per terabyte of the system or the purchase price for the hardiare and sofiaret There are mant additonal costs inrolrede including those for traininge installatone applicaton derelopmente administraton and maintenancet t is important that all of these costs are taren into accountt

3t Undeomtand the educatfn and mkillm oequioementm ffo implementng Hadffp? Mant organizatons donet hare all of the srills or experience required to build and deplot a Hadoop enrironmentt t is important to create an inrentort of existng srills and to identft gaps in srill sets that need to be filled before the project can proceedt n generale serrices from Hadoop rendors andtor third-partt consultng companies can be used to fill those gapst

4t Invemtgate if ftheo paotm ff the foganiiatfn aoe uming Hadffp. Hands on Hadoop experience tares tme to acquiree and project managers should inrestgate if there are other parts of the organizaton that are using Hadoop that can act as a source of rnoiledge and best practcest

5t Talk tf cumtfmeom thf have deplfyed Hadffp mflutfnm. ,rganizatons should also loor for other companies in the same marret sector iho are using Hadoop to see hoi thet are deploying Hadoop

In the case of IT, treiuslnrssecwsrefor Hadoop is ofrne oearducrethe costs of storing, managing and transforming data.

Page 20: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

RECOMMENDATIONS

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 20

and to share informaton about best practcest

6t Be caoeful in yfuo chfice ff Hadffp data management platfom and undeomtand itm impact fn the eximtng IT infoamtouctuoe. There are mant different components that are potentallt inrolred in a Hadoop enrironmentt The actual components used iill depend on project requirements, performance needs and on existng T deplotment strategies and standardst areful platorm selecton is requirede not onlt to ensure that the platorm selected satsfies requirementse but also that it can easilt be integrated into the existng Te business intelligence and data iarehousing infrastructuret

7t Be oealimtca but almf poagmatca abfut the value and ume ff Hadffp. One of the main didcultes in selectng a Hadoop soluton at present is separatng realitt from htpet This is not onlt due to a rapidlt changing marretplace and orer-marretng bt certain rendorse but also the iide dirersitt of opinions about Hadoop use cases, maturity and its role in enterprise ststemst hile there is no doubt that Hadoop is a raluable additon to the technologt toolboxe it is not a panaceat t must be recognized that Hadoop is stll immaturee eren though it is changing and erolring rapidltt This rapid eroluton is beginning to lead to industrt fragmentatone ihich can lead to rendor locr int t is important that organizatons carefullt select a platorm that supports their needse aroids rendor locr-in ihere possible and also prorides a rexible architecture that can erolre iith changes in the Hadoop marretplacet

One of the main dlfficu trselnesr rctngeweHwdoopeso utoneat present is srpwawtngearw l refaometrprr

Page 21: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 21

VEND,R ASSESSMENTS

Vendfo Ammemmmentm The folloiing assessments offer a brief stnopsis of rendores Hadoop data management platormse iith an emphasis on the best rnoin and most iidelt deploted productst t is important to note that these platorms are erolring and changing rapidlte and to reep this report as current as possiblee the assessments may include features that are in an adranced stage of derelopmente but not generallt arailablet All Hadoop platorm products (unless noted otheriise) integrate sereral Apache Sofiare oundaton (AS ) projects including Hadoop (HD S and the MapReduce frameiorr), Hiree Pig and ZooKeepert Some platorms also include other AS projects such as Ambari, lumee H asee H ataloge Mahoute ,ozie and Sqoopt The assessments are broren doin into three categories of rendorse Hadoop rendor startupse enterprise rendorse and database and analttc serrer rendors

Hadffp Vendfo Staotupm THIS SECTION INCLUDES ASSESSMENTS of the three major rendors ( louderae Hortoniorrs and MapR) that focus exclusirelt on proriding Hadoop platorms for businessese rendors and ststems integratorst Clfudeoa (ttt.clfudeoa.cfm)

loudera is a V -funded sofiare compant that prorides Hadoop-based sofiaree traininge consultng and support serricest Mire ,lson, a former Oracle and Sleeptcat Sofiare executree co-founded the company in 2118 iith three engineers from ooglee Yahoo and aceboor (Christophe Biscigliae Amr Aiadallah and eff Hammerbacher)t Doug utnge one of the originators of Hadoope and hairman of the Apache Sofiare oundatone is louderaes chief architectt

Hadffp Sflutfnms CDH, Cloudera Standarde and loudera Enterpriset

CDH (Clouderaes Distributone including Apache Hadoop) is a free open source Hadoop distributon that can be doinloaded from the loudera iebsitet t contains the core components of Hadoop plus Cloudera mpala (an interactre SeQ later for Hadoop) and Hue (a set of ieb applicatons for interactng iith Hiree mpalae MapReducee Pige H ase and other DH components)t DH iorrs iith leading Qinux distributonse and includes an install iizard for deplotment on Amazon eb Serricest t is also certfied for use as a rirtual machine on VMiare rSpheret loudera recentlt announced loudera Searche ihich is an open source project that iill be included iith CDH once beta testng is completet

Page 22: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 22

VEND,R ASSESSMENTS

Clfudeoa Standaod is a Hadoop sofiare platorm that includes DH and the Clfudeoa Manageot Qire DHe it can be doinloaded for free from the loudera iebsitet loudera Manager extends DH iith a graphical interface for installing and configuring a Hadoop clustert ith loudera Standard, some features of the Cloudera Manager are not included in the license (rolling restartse SNMP supporte for example)t Note that Cloudera Standard (formerlt rnoin as the loudera Enterprise ree editon) used to be restricted to a maximum cluster size of 51 nodese but this restricton has noi been remoredt Clfudeoa Enteopoime is louderaes ragship Hadoop sofiare subscriptont It prorides CDH plus the full capabilites of the Clfudeoa Manageoa including authentcaton and securitt features (QDAP and Kerberos)e and enterprise-lerel system monitoring and management capabilitest It also comes iith technical support and legal indemnificaton for subscriberst Cloudera Enterprise offers sereral different subscripton optons that mat be added to the core licenset Examples include Cloudera Enterprise RTD, Cloudera Enterprise RTQ and loudera Enterprise DRt Enterprise RTD adds H ase monitoringe ihile Enterprise RTe is required for monitoring mpala operatonst Enterprise DR includes bacrup and disaster recorert for HD S and Hire datat hen loudera Search becomes generallt arailablee a loudera RTS opton iill be addedt Another opton that can be added to a loudera Enterprise subscripton is Clfudeoa Navigatfot louderaes stated directon for loudera Narigator is for it to proride four main capabilitese data access controls and auditnge metadata reportnge data lineage reportng and data lifectcle managementt Release t1 of Narigator delirers the first of these capabilitese data access controls and auditngt Licenminge loudera Enterprise is licensed as part of a Clfudeoa Enteopoime Submcoiptfn—each subscripton includes both the sofiare and support serricest There are subscriptons for loudera Enterprise (includes Enterprise DR and Narigator)e Enterprise RTD and Enterprise RTet Data Accemm and Manipulatfns Data on a Cloudera Hadoop cluster is accessed using SeQ (Hire and mpala)e Pig scriptse MapReduce applicatons or search (using loudera Searche ihich is based on the Qucene search engine)t

Page 23: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 23

VEND,R ASSESSMENTS

Cloudera is also iorring on support for D3 (data driren documents) risualizaton and Sparr in-memort computng for iteratre algorithmse such as those used in data mining and machine learningt SQL Suppfots n additon to Hiree loudera prorides interactre SQL capabilites through its mpala projectt mpala is an SeQ later dereloped bt loudera that adds an interactre quert facilitt for accessing HD S and H ase datat t is intended to complement the batch-oriented nature of Hire and MapReduce processingt mpala does not use Hire (or MapReduce)e but its SeQ stntax is broadlt compatble iith Hiree and it lererages existng Hire metadatat Although mpala is Apache licensede it is not an Apache projectt loudera coordinates the open source communitt for mpalat mpala improres quert response tmes compared iith Hiree but the first release of the product is limited in SeQ functonalittt It is therefore more suited to basic quert operatonse rather than complex SeQ analtsest t is reasonable to assume that mpala functonalitt iill be enhanced in subsequent releasest loudera is also dereloping (in conjuncton iith Tiiter and other partners) the Parquet columnar data store for use iith mpalae ihich iill further improre the performance of SeQ queriest

Hfotfntfokm (iiithortoniorrstcom)

Hortoniorrs is a sofiare compant that prorides Hadoop-based sofiaree traininge consultng and support serricest enchmarr apital together iith Yahoo, founded the company in 21 t Hadffp Sflutfnms Hortoniorrs Data Platorm (HDP) and Hortoniorrs Sandboxt HDP is a free open source Hadoop platorm that can be doinloaded from the Hortoniorrs iebsitet t contains all the ret components of a Hadoop platorm and iorrs iith leading Qinux distributonst Unlire other Hadoop platormse it also supports Microsof indois Serrert loud deplotment is arailable for Microsof Azuree ,penStacr and Racrspacet High-arailabilitt capabilites are arailable for the Red Hat Qinux and VMiare enrironmentst HDP includes Apache Ambari (an open source management and monitoring tool) and Talend Open Studio for Big Data (a leading open source data integraton and transformaton tool)t It also prorides an N S interface so that Hadoop can be mounted as a netiorr file ststemt

Page 24: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 24

VEND,R ASSESSMENTS

Hortoniorrs has close technologt relatonships iith Microsof and Teradatae iho both use HDP in their Hadoop platormst Hfotfntfokm Sandbfx is a free self-contained and preconfigured rirtual machine iith built-in interactre tutorials for learning about Hadoopt t runs on both indois and Mac computers iith a minimum of 4 of main memortt t requires Virtual oxe VMiare Plater, VMiare usion or Hyper-V rirtualizaton sofiaret Licenmings Hortoniorrs does not charge for HDP or the Hortoniorrs Sandboxe but instead generates rerenue through its sofiare support subscriptonse public and prirate training, and consultng serricest Sereral different support optons are arailablee depending on the needs of the customert Data Accemm and Manipulatfns Data on a Hortoniorrs Hadoop cluster is typically accessed and processed using SQL ria Hiree Pig scripts, MapReduce applicatons or HBase applicatonst Hortoniorrs partners iith Qucid maginaton to mare its Qucid orrs search interface arailable for HDPt Qucid orrs is a commercial rersion of Apache QucenetSolrt SQL Suppfots HDP supports SQL access using Hiret Hortoniorrse in conjuncton iith aceboore Microsofe SAP and other partnerse has launched the Stnger initatre iithin the Hire open source communityt The goal of Stnger is to mare Apache Hire perform 11 tmes faster so that it can support interactre SeQ queries against data stored iithin Hadoopt Stnger is being delirered in three phasest Phase extends SeQ functonalitt and SeQ join performancee and adds a nei columnar file format rnoin as an ,R ilet Phase 2 exploits Apache YARN to enable a iider range of iorrloadse and adds a facilitt rnoin as Tez (ihich is an Apache incubator project) to improre orerall runtme job performancet Phase 3 includes a nei quert enginee a cost-based optmizer and improred buffer managementt The goal of the Stnger initatre is to derelop and delirer all of these enhancements through open source community projects such as Apache Hire and Tezt Phase of Stnger is complete and arailable in Hire and HDPt

MapR (ttt.mapo.cfm)

MapR is a V -funded sofiare compant that prorides Hadoop-based sofiaree traininge consultng and support serricest t ias founded in 2110t

Page 25: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 25

VEND,R ASSESSMENTS

Hadffp Sflutfnms MapR M3 reee M5 Editon and M7 Editont M3 Foee is a no cost Hadoop sofiare platorm that can be doinloaded from the MapR iebsitet t contains the core components of Hadoop and iorrs iith leading Qinux distributonst loud deplotment is arailable for Amazon eb Serrices and the oogle loud Platormt A rirtual machine rersion is also arailable for use iith the VMiare Platert M3 includes the MapR Data Platorme ihich replaces HD S iith a nei file ststem that supports HD S and N S applicaton interfacest M3 supports both mult-iriter and random readtirite operatonse ihich aids performancet M3 also includes the MapR ontrol Ststem (M S)e ihich is a broiser-based management console for rieiing and controlling a Hadoop clustert The M5 Editfn is a Hadoop sofiare platorm that extends MapR M3 iith mult-node N S for high arailabilitt (M3 supports onlt single-node N S)e remote data mirroringe snapshots for fast data recorerte and data placement controlt t also prorides high arailabilitt bt eliminatng the Hadoop obTracrer and NameNode single points of failuret The M7 Editfn further extends the capabilites of the M5 editon iith a unified file ststem (MapR- S) that supports both HD S file data and HBase table datat This file ststem improres the performance and resilienct of HD S and H asee and is AP compatble iith both of these later ststemst Licenminge M3 ree is a no cost-pacrage that is supported through MapRes communitt forumst M5 and M7 are licensed offerings that proride full support from MapR including on-demand patches and online incident submissiont Data Accemm and Manipulatfns Data on a MapR Hadoop cluster is accessed using Hire HireeQ statementse Pig scripts or MapReduce applicatonst MapR partners iith Qucid maginaton to mare its Qucid orrs search interface arailable for HDPt Qucid orrs is a commercial rersion of Apache QucenetSolrt SQL Suppfote MapR is hearilt inrolred in the Apache Drill incubator projecte ihich is inspired bt ooglees Dremel and igeuert projectst MapR iill almost certainlt use Drill for supportng the interactre quert and analtsis of HD Se H ase and MapR- S datat Drill's primary query language, DrQL, is an SQL-lire language that is compatble iith oogle BigQuertt

Page 26: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 26

VEND,R ASSESSMENTS

Clfudeoaa Hfotfntfokm and MapR Pfmitfnings etieen theme these three rendors dominate the use of independent Hadoop sofiare platorms in organizatonst ,utlined beloi are some ret product differentatorst All three rendors are extending their products in areas such as installatone securitt and ststems management to improre ease of use and add capabilites expected bt enterprise userst loudera ias first to marret and enjots a solid positon in the Hadoop marretplacet It also has an extensire partnership programt Hortoniorrs is the onlt rendor of the three that does not charge a license fee for ant of its sofiaret t is also the onlt rendor to support Hadoop on indois Serrer—Microsof uses the Hortoniorrs HDP platorm for Microsof HDInsight running on indois Serrer and the indois Azure serricet ,f the three rendorse MapR replaces the most Apache Hadoop componentst Although these changes atempt to maintain AP compatbilitt iith corresponding Apache componentse sereral of them are nerertheless proprietartt These extensions are a complete redesign of the underlying Hadoop architecture and enable MapR to offer high performance and stability to enterprise customerst iren the rapid derelopment of Apache Hadoop-related projectse hoierere it remains to be seen if the MapR extensions iill in the long term contnue to offer an adrantaget ,f coursee the proprietart nature of the MapR extensions is not unique—many enterprise rendors iith Hadoop solutons add their oin proprietart featurest The MapR extensions are arailable to the open source community through GitHub (github.cfm)t oth loudera and Hortoniorrs claim their sofiare is 110 open sourcet Hoierere it is important to understand ihat this meanst There are tio main benefits of open sourcet The first is that it encourages communitt partcipaton in both support and product enhancementst n generale this first benefit is achiered regardless of the methode or organizatone used to mare the sofiare arailable to the open source communittt The second benefit of open source is the abilitt to include nei enhancements in multple productst In the case of Hadoop, it can argued that this may iorr best if the enhancements are part of an Apache Sofiare oundaton (AS ) projectt This is because the Hadoop marret is startng to fragment in the same iat that Unix and Qinux didt A rendor mat mare a nei feature arailable in an open source formate but this does not mean other rendors iill picr up and integrate the feature unless it is a part of an AS projectt This mat be due to integraton didcultes or because the rendor mat alreadt proride an alternatre solutont This iht the role of the Apache Sofiare oundaton is so

Page 27: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 27

VEND,R ASSESSMENTS

important—it encourages rendors to incorporate “standard” AS componentst An example here is SeQ supportt All three rendors are iorring on their oin SQL initatres— loudera mpalae Hortoniorrs Stnger and MapR Drillt At presente the Hortoniorrs and MapR efforts are being dereloped through the AS processe ihereas loudera mpala is not an AS projectt The botom line is that cauton needs to be exercised ihen emploting capabilites or components that are not a part of the AS processt It should be noted that in discussions iith rendors and other industrt platerse not erertone agrees iith the abore comments concerning the importance of the AS processt These three rendors offer independent Hadoop sofiare platorms that can be deploted in a rariett of hardiare and sofiare enrironmentst They reduce installaton and management costs compared iith doinloading and integratng Hadoop components directly from the AS t Thet also offer a lost-cost alternatre to the enterprise solutons that are discussed nextt This loier cost may come at the expense of arailabilitte performance and increased installatone administraton and derelopment costst Hoierere these platorms form the core of certain enterprise rendor products; and so it iould be possiblee in some casese to start iith one of these platorms and then subsequentlt more to an enterprise rendor soluton as necessartt inallte it must be remembered that although all three rendors enjot considerable risibilitte thet are small and stll in start-up modet

Enteopoime Vendfom THIS SECTION REVIEWS THE HADOOP SOLUTIONS offered bt established enterprise hardiare and sofiare rendorst HP (iiithptcomtgothadoop)

HPes Hadoop products are a part of its derelopment and marretng thrust to offer big data solutonst t brears these solutons into three componentse informaton insight (HP Autonomte HP Vertca and HP AppStstem for SAP HANA)e informaton management (HP AppStstem for Apache Hadoop) and informaton infrastructure (HP StoreAlle HP BladeSystem and HP Moonshot)t Visit ttt.hp.cfm/gf/bigdata for more details about HPes big data initatret This secton rerieis Hadoop support in the informaton management componentt Hadffp Sflutfnms HP AppSystem for Apache Hadoop and HP Reference Architectures for louderae Hortoniorrs and MapRt

Page 28: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 28

VEND,R ASSESSMENTS

The HP AppSymtem ffo Apache Hadffp is an integrated hardiare and sofiare Hadoop appliance that includes the folloiing componentse

HP PofLiant Gen8 Seoveom and HP Nettfoking Stitchem. The appliance is arailable in both half-racr and full-racr configuratonst The single racr rersion comes iith management nodee name nodee job tracrer nodee 8 iorrer nodes and 2 Ethernet siitchest ustomers can scale-out these configuratons bt adding additonal racrst HP has customers running Hadoop clusters iith sereral hundred nodest HP Inmight Clumteo Management Utlity (Inmight CMU)t MU is a tool for deploying and managing Linux-based nodes in large clusterst t enables east scale-out deployment, remote cluster management, and real-tme and historical monitoringt Clfudeoa Enteopoime Hadffp platfom. See the Cloudera Enterprise assessment in this secton for further detailst Red Hat Enteopoime Linux. Visit ttt.oedhat.cfm for more detailst

HP Veotca Cfmmunity Editfn. Vertca is a relatonal D MS iith a columnar data store that is optmized for analttc processingt The ommunitt Editon comes iith built-in ebHD S connectors that enable Hadoop HD S data to be loaded into a Vertca databaset

The HP Refeoence Aochitectuoem for louderae Hortoniorrs and MapR are recommended hardiare and sofiare configuratons dereloped jointlt bt HP and the Hadoop sofiare proridert The objectre is to proride Hadoop solutons that balance performancee storage and costt Each reference architecture documents a progression of configuratons from single-racr to mult-racr Hadoop clusterst A customer can select and purchase a specific configuratone and then hare HP build and test the ststem prior to delirertt n the case of louderae the reference architecture has erolred into an integrated appliance of sofiaree hardiare and support serrices—the HP AppSystem for Apache Hadoopt t is lirelt that the Hortoniorrs and MapR reference architectures could erolre in the same mannert HP Pfmitfninge n a similar iat to other enterprise rendorse HP rieis Hadoop as one component of a big data ecoststemt HP positons Hadoop as an informaton management ststem that is used primarilt as a data

Page 29: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 29

VEND,R ASSESSMENTS

refinert to capturee managee filtere transform and aggregate large rolumes of datae especiallt mult-structured datat The results of this processing can then be loaded into other HP products, such as Autonomy and Vertcae for more adranced analttc processingt rom a Hadoop perspectree HPes main objectre is to extend Hadoop platorms from companies such as louderae Hortoniorrs and MapR iith additonal capabilites and serrices that mare Hadoop more suitable for enterprise uset Another objectre is to proride Hadoop connectors that alloi data to be mored into existng enterprise ststemst HPe for examplee prorides Hadoop connectors for HP Autonomt and HP Vertcat

IBM (ttt.ibm.cfm/mfftaoe/data/bigdata)

M is inrestng significant derelopment and marretng resources in its big data platorme of ihich Hadoop is one componentt An excellent in-depth guide to the big data platorm can be found in the boore Hwrness the Poter of Big Dwtw: The IBM Big Dwtw Plwtormt An electronic rersion of the boor can be doinloaded from the big data secton (see linr abore) of the M iebsitet Hadffp Sflutfnms nfoSphere ig nsights asic Editone InfoSphere ig nsights euicr Start Editone InfoSphere BigInsights Enterprise Editon and the Puredata Ststem for Hadoopt The BigInmightm Bamic Editfn is a free doinload that prorides an integrated sofiare platorm of Apache Hadoop components (informallt rnoin as the M Distributon for Hadoop)t t comes iith the ig nsights installer to simplift the tasr of installing and configuring a Hadoop clustert f requirede this editon can be used iith the loudera DH Hadoop distributont The Big Inmightm Enteopoime Editfn is a licensed Hadoop sofiare platorm that adds a iide range of capabilites to the asic Editone includinge

• Eclipse IDE plug-ins for derelopinge testng and running applicatons using Hiree Pige MapReducee H asee ,ozie iorrroise M aql and M text analttcst

• Toolrits and analttc functons for text analttcse data mining and machine learningt

Page 30: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 30

VEND,R ASSESSMENTS

• Pacraged applicaton accelerators for creatng analttcs using texte social computnge machine and telecommunicatons erent datat

• BigSheets, a spreadsheet data risualizaton tool for data analtsis and exploratont

• ig ndex search engine based on Apache Qucenet

• ig nsights ieb console to monitor and manage the Hadoop clustere broise filese run analttc and ad hoc R applicatonse and perform BigSheet analysest

• P S- P, ( eneral Parallel ile Ststem – ile Placement ,ptmizer)—an optonal file ststem that replaces HD Se and adds improred performancee POSIX security, enhanced systems management, and high arailabilitt and disaster recorertt

• Adaptre MapReduce to improre the performance of small MapReduce jobst

• ig nsights scheduler for adaptable iorrroi allocatont

• Automatc and transparent failorer that remores Hadoopes NameNode and obTracrer single points of failuret

• nfoSphere Streams limited editon for creatng real-tme analttcst

• nfoSphere Data Explorer limited editon for exploring and risualizing large amounts of structured and mult-structured datat

• M ognos Qimited Editon that allois business users and analysts to access and process Hadoop data using a Hire interfacet

• ntegraton iith other M productse including nfoSphere Guardium data security, InfoSphere Streams, DB2, PureData System appliances and the DataStage component of the nfoSphere nformaton Serrert

A free, non-producton Quick Staot Editfn of BigInsights is also arailable for doinload from the M iebsitet This rersion prorides the

Page 31: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 31

VEND,R ASSESSMENTS

capabilites of the asic Editon plus sereral features from the Enterprise Editone including Eclipse IDE plug-ins for derelopmente text analttcs toolrite igSheetse igSeQ (see descripton beloi)e adaptre MapReducee and the ig nsights ieb console and schedulert InfoSphere BigInsights can be deployed on systems running Red Hat Enterprise Linux, SUSE Linux Enterprise or IBM Poier Qinuxt t is also designed to iorr in a cloud enrironmente and has been deploted successfullt on Amazon eb Serricese RacrSpace and the M Smart loudt At the tme of iritnge M had just announced the PuoeData Symtem ffo Hadoop for arailabilitt in the later part of 21 3t This ststem is an integrated hardiare and sofiare appliance that includes ig nsights running on IBM System x iith Red Hat Qinuxt Ket features of this appliance are high arailabilitt and securitte a single console for managing the hardiare and sofiare enrironmente and built-in data archiring toolst SQL Suppfote Prior to the recentlt announced 2t release of ig nsightse M supported HireeQ and its aql SeQ-lire language for accessing and processing datat ith 2t e M is adding igSeQe ihich is compatble iith the SeQ-02 standardt t adds sereral extensions to SeQ-02e including a rariett of built-in functonst The igSeQ serrer and SeQ engine run on a single node of a Hadoop cluster and receire SeQ requests ria D and ,D drirerst These SeQ requests mat be processed on the igSeQ serrer using local storage handlers (H asee sequental filese for example)e andtor routed to other Hadoop cluster nodes for processing bt Hiret f necessarte the SeQ engine iill reirite queries to improre performancet The igSeQ serrer coordinates and merges results from local files and the Hadoop clustert A ret feature of igSeQ is to proride edcient H ase processing of both local data and data distributed across the clustert igSeQ uses its oin H ase data handlere data encoding and indexes to achiere thist The big benefit of igSeQ is that it enables H ase to be used for managing a large number of smaller tables and for ad hoc queriese ihile contnuing to support Hire and MapReduce jobs for batch-style processing against large data filest IBM Pfmitfninge M rieis Hadoop as a raluable ststem for extending the capabilites of existng enterprise ststemst The four InfoSphere ig nsights configuratons it offers— asic Editone euicr Start Editone

Page 32: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 32

VEND,R ASSESSMENTS

Enterprise Editon and Puredata Ststem for Hadoop—enable customers to start small and scale out their Hadoop enrironments as requirements groit InfoSphere BigInsights enhances the open source Apache Hadoop enrironment iith a range of tools and functons that mare Hadoop suitable for use in the enterpriset These enhanced capabilites ease installaton and administratone improre performance and securitte increase applicaton derelopment productritte and add toolrits and accelerators for creatng analttc applicatonst t also includes a iide range of capabilites that enable integraton iith other M business intelligencee data managemente and ststems management productst BigInsights is used as a data refinert for collectng and transforming large rolumes of mult-structured data for use bt doinstream M ststemse and also as an inrestgatre computng platorm for exploring and analtzing mult-structured datat

Intel (ttt.hadffp.intel.cfm)

Intel sees a huge hardiare opportunitt in big data and Hadoop; and unlire other enterprise rendors, its motraton in offering Hadoop sofiare solutons is not to lererage other enterprise sofiare productse but to enhance open source sofiare so that it exploits the performance capabilites of nteles hardiare offeringst Hadffp Sflutfne Intel Distributon for Apache Hadoop Sofiaret The Intel Dimtoibutfn ffo Apache Hadffp Sfftaoe is a licensed pacrage of integrated Hadoop sofiare components (and support serrices) that is optmized for ntel eon processorse ntel SSD data storage and ntel 1 bE netiorringt t includes a set of Apache Hadoop components that hare been enhanced to improre installaton and administratone securitt and performancet Ket enhancements in the distributon includee

• ,ptmized support for ntel processore storage and netiorring hardiaret ntel claims these optmizatons can boost orerall Hadoop performance up to 31 tmest Various improrements are also included to boost both Hire and H ase performancet ntel has dereloped the Hi ench suite of 1 iorrloads to benchmarr the performance of Hadoopt This benchmarr is arailable under an Apache open source licenset

Page 33: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 33

VEND,R ASSESSMENTS

• The ntel Manager for Apache Hadoop Sofiaree ihich is a management console that simplifies the installatone configuratone tuninge monitoring and securitt of a Hadoop deplotmentt t uses Nagios (see ttt.nagifm.fog) and anglia (see ttt.ganglia.mfuoceffoge.net) to monitor resources and configure alertst

• Support for Intel AES-N hardiare encrtpton technologt to boost performance ihen encrtptng and decrtptng HD S and H ase datat

• Role-based access controls for HD S and H asee and cell-based access controls for H aset QDAP and Actre Directort authentcaton and Kerberos securitt are also supportedt ntel has launched project Rhino (see ttt.github.cfm/intel-hadffp)e ihich is an open source effort to improre the data protecton capabilites of Hadoopt

• Adaptre data replicaton for HD S and H ase that adjusts the number of replicas depending on iorrload characteristcst

• ,ptmized Reroluton Technologies R connector for statstcal analtsist

• ntel raph uilder for graphical analtsist This component is arailable to the open source communitt through an Apache licenset

• Partnership iith MarrQogic to offer MarrQogices big data search technologt on top of HD St

• onnectritt to SAP Hana using SAPes Smart Data Access data rirtualizaton technologtt

The ntel distributon supports ent,Se ,racle Qinux and Red Hat Enterprise Qinuxt SQL Suppfote ntel is iorring iith partners on project Panthera (see ttt.github.cfm/intel-hadffp/pofject-pantheoa)e ihich is designed to improre Hadoop SeQ functonalitt and quert performancet This project currentlt has tio componentse an SeQ engine built on top of Hire that extends Hirees SeQ capabilites and a document-oriented store for HBase that reduces table storage requirements and boosts SQL query performancet nteles objectre is to mare all of these enhancements

Page 34: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 34

VEND,R ASSESSMENTS

arailable to the open source communitt through the Apache Hire and H ase projectst t is also iorth notng that ntel is iorring iith Salesforcetcom to merge the Salesforce Phoenix SeQ project for H ase into Pantherat Intel Pfmitfninge nteles rerenue comes from hardiare and serricese not sofiaret t ise hoierere motrated to ensure that leading sofiare solutons iorr edcientlt and exploit the performance capabilites of its existng and future hardiaret iren the groiing use of Hadoop, and the amount of data being managed on Hadoop ststemse the potental hardiare rerenue for ntel is significantt The ntel Apache Hadoop distributon is designed to improre the capabilites and performance of Hadoop applicatons running on Intel hardiaret n generale the improrements ntel is maring are being made arailable to the open source communitt and, in many cases, being incorporated into Apache projectst n mant respectse the ntel Distributon should be compared iith distributons from louderae Hortoniorrs and MapRe rather than solutons from other enterprise proriderst The ntel product is of ralue to ststems integrators and other partnerse but it remains to be seen hoi iell it iill be adopted bt mainstream enterprisese compared iith open source alternatrest ,f course, one benefit ntel brings is that it is unlirelt to go out of business or be acquired bt another enterprise rendor!

Micofmff (ttt.micofmff.cfm/bigdata)

Microsof is motrated to ensure that Hadoop ststems can coexist and interoperate iith existng Microsof productst iren that most Hadoop products onlt iorr on Qinuxe it is also motrated to increase the adopton of Hadoop on its oin ststem sofiaree specificallt Microsof indois Serrere and in the cloud on indois Azuret ts technologt relatonship iith Hortoniorrs is intended to achiere both of these objectrest Hadffp Sflutfns HD nsight Serrer for indois HDInmight ffo Windftm is a licensed Hadoop sofiare platorm built on top of the Hortoniorrs Data Platorm (HDP)e and the assessment of the Hortoniorrs HDP earlier in this document should be referenced for an orerriei of the Hadoop features in HD nsightt This secton iill focus on the capabilites in HD nsight that alloi it to coexist iith other Microsof solutonst A summart of these capabilites folloise

Page 35: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 35

VEND,R ASSESSMENTS

• Hire add-in for Excel that allois Excel and Excel Data Explorer users to access data in Hadoop Hire tablest

• Hire ,D drirer that enables SeQ Serrer tools such as PoierPirot and Poier Viei to access data in Hadoop Hire tablest

• Support for MapReduce jobs in Microsof tNETe araScript applicatons and broisers supportng HTMQ 5t

• Bi-directonal Hadoop connectors for exchanging data iith Microsof SeQ Serrer Parallel Data arehouse (PD )t This is implemented bt a facilitt rnoin as Poltbasee ihich allois HD S data to be rieied as an external table bt SeQ Serrert Poltbase supports parallel data transferse and allois retriered HD S data to be combined iith SeQ Serrer datat n the first release of Poltbasee the complete file is copied into SeQ Serrere but Microsof is iorring on push-doin optmizaton that iould enable MapReduce jobs running on Hadoop to do more of the iorr and thus reduce the amount of data being transferred to SeQ Serrert

• ntegraton iith Microsof Ststem enter and indois Serrer Actre Directortt

n additon to supportng Microsof indois Serrere HD nsight is also arailable as a cloud serrice on indois Azure and is supported for use in a rirtual machine enrironment using Microsof Htper-V and the Microsof Ststem enter Virtual Machine Managert Micofmff Pfmitfninge Mant enterprises use Microsof ,dcee Microsof indois Serrer and Microsof SeQ Serrere and it iill become increasinglt importante as the use of Hadoop groise for these companies to be able to access Hadoop data using familiar toolst onnectritt to Hadoop data from existng Microsof productse hoierere is immature and organizatons should be cautous in their use of these capabilitest iren that mant Microsof customers hare litle or no experience iith Qinuxe the abilitt to deplot Hadoop in a indois enrironment is atractret At presente HD nsight is the onlt product that supports Hadoop on indoise and this puts both Microsof and Hortoniorrs in a solid compettre positont

Page 36: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 36

VEND,R ASSESSMENTS

Ooacle (ttt.foacle.cfm/um/pofductm/databame/big-data-appliance/)

,racle derelops and marrets a portolio of integrated hardiare and sofiare appliances for managing and processing big data, including the ,racle Exadata Database Machinee ,racle Exalttcs n-Memory Machine and the ,racle ig Data Appliancet t also prorides sofiare for filtering and transforming data and for moring data betieen these three appliances—examples include Oracle Data Integrator and Oracle Big Data onnectorst This assessment focuses on ,raclees solutons for the Apache Hadoop enrironment and loors specificallt at the ,racle ig Data Appliancet Hadffp Sflutfns Big Data Appliance The Oracle Big Data Appliance X3-2 is an integrated hardiare and sofiare ststem that is built on top of louderaes DH Hadoop distributont The assessment of loudera DH earlier in this document should be referenced for an orerriei of the Hadoop features of DHt This secton iill focus on the additonal ,racle capabilites used to build out the appliancet These includee

• A ull Racr hardiare configuraton consistng of 8 Sun dual 8-core ntel eon processor nodes iith a 41 btsecond nfini and interconnectt A Starter Racr configuraton is also arailable containing 6 Sun serrers in a full racrt An n-Racr Expansion opton of 6 Sun serrers enables the starter configuraton to be expanded to 2 nodese and then to a full racr of 8 nodest The configuraton allois up to 8 racrs to be interconnected, but larger ststems can be supported bt adding more nfini and siitchest

• ,racle Qinux and ,racle Hotspot VMt

• Oracle Enterprise Manager plug-in for managing and monitoring Hadoop hardiare and sofiare operatonst

• Oracle NoSQL Database ommunitt Editone a nonrelatonal ret-pair database ststem that can be used in conjuncton iithe or in place ofe the data management capabilites of loudera DH (Pige Hiree MapReducee HD Se H ase)t The ,racle NoSeQ Database is built on the Oracle errelet D ara Editon that has been enhanced to run on a mult-node clustert t is intended for loi-latenct ad hoc quert and update processingt An Enterprise Editon of the product is arailable as a separatelt licensable

Page 37: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 37

VEND,R ASSESSMENTS

optont This later editon prorides integraton iith other ,racle products and prorides ,racle support serricest

• ,racle R distributone an ,racle supported distributon of open source Rt

,racle offers optonal licensable ,racle ig Data onnectors for use iith the Big Data Appliance (and other generic Hadoop ststems)e

• Oracle Loader for Hadoop uses MapReduce processing to conrert Hadoop and ,racle No-SQL database data into an Oracle format for loading into an ,racle database on another ststemt Tio modes of operatons are supported—online mode immediatelt loads the conrerted data into an ,racle databasee ihereas offline mode leares the conrerted data on the Hadoop ststem for later access and uset

• ,racle SeQ onnector for Hadoop Distributed ile Ststem allois ,racle Database applicatons on another ststem to use ,racle SeQ statements to access Hadoop Hire tables and HD S data filest These files and tables are accessed bt ,racle applicatons as SeQ external tablest This connector can also be used to load Hadoop data files created bt the ,racle Qoader into an ,racle databaset

• ,racle Data ntegrator Applicaton Adapter for Hadoop enables a Data ntegrator applicaton running on Hadoop or a remote ,racle ststem to use Hire to load external data into a Hadoop file and transform itt f requirede other Oracle Hadoop connectors can then load the transformed data into a remote ,racle databaset

• ,racle R onnector for Hadoop enables R scripts and analttc functons running on the Hadoop cluster to access Hire tables and HD S data filese run MapReduce applicatons and interact iith an external ,racle database ststemt

Ooacle Pfmitfninge ,racle rieis the ,racle ig Data Appliance as one component of a big data ecoststemt t quite clearlt positons the appliance as a ststem that is used in conjuncton iith other ,racle enterprise componentst t especiallt focuses on using the appliance as data refinert for capturing and transforming large rolumes of mult-structured data, and feeding subsets of that data to other Oracle ststemst t also sees the appliance be used for analtzing mult-structured

Page 38: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 38

VEND,R ASSESSMENTS

data in a Hadoop data refinert and also possiblt for green-field projects in companies that do not hare an existng data iarehousing or ststemt

Pivftal (gfpivftal.cfm/pivftal-pofductm/pivftal-data-faboic/pivftal-hd) Pirotal is a nei renture that ias spun out from EM and VMiaret E has also inrested $ 15 million in the compantt Pirotal brings together a number of former EM tVMiare assets including etase loud oundrte em iree reenplum SpringSource and Pirotal Qabst The goal of Pirotal is to use these components to build Pirotal ,nee an enterprise platorm-as-a-serrice (PaaS) soluton for big data analttcst Although Pirotal has a close iorring relatonship iith EM and VMiaree the objectre of the compant is to be platorm agnostct This independencee if it can be maintainede iill be important because customers are equallt interested in deploting big data solutons on other cloud platorms such as the Amazon public cloud and the ,penStacr prirate cloud in additon to VMiare cloud offeringst This assessment examines Pirotales Hadoop solutonse of ihich reenplum relatonal database sofiare is a ret componentt Hadffp Sflutfnms Pirotal HD Enterprise, Pirotal HD Community, Pirotal Data omputng Appliancet Pivftal HD Enteopoime is a licensed Hadoop sofiare platorm containing Apache Hadoop open source components integrated iith additonal capabilites designed to support the use of Hadoop in enterprise ststemst These additonal capabilites includee

• Installaton and configuraton manager • Management console for monitoring and managing Hadoop

cluster operatons • Applicaton tasr management using capabilites from the Spring

rameiorr • Parallel HD S data loader • Pirotal Adranced Database Serrices (ADS)e a modified rersion of

the Pirotal Greenplum Database for use on a Hadoop clustert Pirotal ADS uses HD S to store and manage data (see “SQL Support” for more details)t

• Hadoop Virtualizaton Extensions (HVE) that extend Pirotal HD to support VMiare rirtualizaton technologt bt adding rirtual node aiareness and greater cluster elastcittt

Page 39: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 39

VEND,R ASSESSMENTS

• Unified Storage Serrices that support data access to traditonal Hadoop direct-atached storage as iell as EM silon ,ne S Scale-,ut NAS Storaget EM silon improres arailabilitt, remores the need for Hadoop data replicaton (improring storage utlizaton) and offers data mirroringe replicatone snapshots and bacrup serricest

Pirotal HD Enterprise can be deploted on a range of different hardiare configuratonse including the Pirotal Data omputng Appliancet t supports ent,S and Red Had Enterprise Qinuxt Pivftal HD Cfmmunity is a free doinloadable rersion of Pirotal HD Enterprise intended for eraluaton and educaton purposest The Pivftal Data Cfmputng Appliance (DCA) is an integrated hardiare and sofiare ststem that supports HD Enterprise running on generic ntel hardiare and Qinux sofiaret t prorides scalabilitt from a racr to a 2-racr ststemt The nodes on a D A cluster interoperate across a 1 btsecond Ethernet interconnectt Note that HD Enterprise and the Pirotal reenplum Database can coexist on the same appliancet SQL Suppfote SeQ support in Pirotal HD Enterprise is prorided bt Pirotal ADS (also rnoin as HA e)t The HA e SQL engine runs on a master node of the Hadoop cluster and receires SeQ requests ria D and ,D drirerst These SQL requests process table data stored on the Hadoop cluster in HD S filest HA e supports the same SQL syntax as the Pirotal reenplum databasee ihich is broadly compatble iith SeQ-00 and sereral OLAP extensions from SQL-2113t The Pirotal e tension rameiorre as its name impliese extends HA e iith a set of interfaces that enable natre HD S filese Hire tables and H ase tables to be defined to HA e as externwl tables, and queried using SeQt External tables hare the adrantage that the natre data in them can be accessed through other Hadoop interfaces, but they do not perform as iell as ADS-formated tablest ,n the other hande ADS-formated tables are in a proprietary format and can only be accessed using HA et Pivftal Pfmitfninge Pirotal is a nei compant that brings together a number of EM and VMiare assetse including reenplum relatonal database technologtt t iill be a ihile before it becomes clear hoi iell these assets can be incorporated into a single cohesire ststemt Pirotales stated directon is toiard Pirotal ,nee an enterprise PaaS solutont or Greenplum this represents a dramatc change in strategt from competng against other enterprise database and analttc solutons to be used in enterprise cloud and rirtualized PaaS enrironmentst Qire sereral other

Page 40: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 40

VEND,R ASSESSMENTS

enterprise rendorse Pirotal sees groith and rerenue in supportng PaaS, but it remains to be seen if this rision can be realizedt

Teoadata (ttt.teoadata.cfm)

Teradata for many years focused on high-performance solutons for large-scale data iarehousingt More recentlte it has erolred its product portolio to offer a range of sofiare products and hardiaretsofiare appliances for sereral different ttpes of iorrload and applicaton use casest t also recentlt acquired Aster Data Systemse ihich offers hardiaretsofiare appliances and sofiare-onlt analttc solutons for processing and analtzing large rolumes of mult-structured datat n additon to Hadoop connectritte Teradata also derelops and marrets the ig Analttcs Appliancee ihich can run the Aster relatonal database and Hadoop sofiare on a single hardiare and sofiare appliancet Hadffp Sflutfns Teoadata Pfotflif ffo Hadffp The Teradata Portolio for Hadoop consists of the folloiing offeringse

• Teradata Aster ig Analttcs Appliance • Teradata Appliance for Hadoop • Teradata ommoditt onfiguraton for Hadoop • Teradata Sofiare-Only for Hadoop

The Teradata Amteo Big Analytcm Appliance is a Teradata parallel processing hardiare platorm designed for inrestgatre computng and data discorert that can be configured iith Aster Database nodes exclusirelte Hadoop Hortoniorrs Data Platorm (HDP) nodes exclusirelte or a mixture of Aster and Hadoop nodest Separate nodes can also be used for data bacrup and loadingt The assessment of Hortoniorrs HDP earlier in this document should be referenced for an orerriei of the Hadoop features in the Aster Appliancet This secton iill focus on the capabilites in the appliance that alloi it to coexist iith other Teradata solutonst A summart of these capabilites is listed beloie

• Teradata hardiare platorm iith dual 8-core or 6-core Intel

Xeon processor nodes, enterprise-class storagee a 41 btsecond nfini and node interconnecte and a SuSE Qinux operatng ststemt RA D storage can also be used for high arailabilittt Each Aster Database node can store up to 5t5 TB of data, and each Hadoop node up to 0t5 T of datat A full racr can support up to 8 nodest The ststem can in theort scale up to a e111 racrs and support 5 P of Aster data or 1 P of Hadoop datat

Page 41: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 41

VEND,R ASSESSMENTS

• Teradata serrer management sofiare that proactrelt monitors sofiare and hardiare erents for failurese and automatcallt reports exceptons and diagnostc data to Teradata ustomer Serricest

• A common management console (Teradata Vieipoint) for managing and monitoring both Aster Database and Hadoop nodest

• Aster Database iith pre-pacred SeQ-MapReduce analttc and data transformaton functons that can retriere and process data managed bt both Aster Database and Hadoop nodest A derelopment rit is prorided for iritng additonal SQL-MapReduce functons in languages such as CCe Ce arae Ptthon and Rt

• Aster Database SQL-H capability that extends SQL-MapReduce iith the abilitt to access metadata managed bt the Hadoop H atalog facilitt to simplift access to Hadoop datat

• High-speed Aster-Teradata connector that enables the appliance to exchange data iith a Teradata data iarehousing ststemt

The Teoadata Appliance ffo Hadffp is an integrated hardiare and sofiare platorm that features the Hortoniorrs Data Platorm running on Teradata hardiaret The platorm comes iith sereral arailabilitt and systems management enhancements, including the ability to monitor and manage the complete ststem from the Teradata Vieipoint interfacet Teradata positons the appliance as a Hadoop data refinert for storinge managing and transforming large rolumes of mult-structured datat The Teoadata Cfmmunity Editfn ffo Hadffp is a joint partnership betieen Teradatae Hortoniorrs and Dell to proride an optmized hardiare and sofiare enrironment for the Hortoniorrs Data Platormt ustomers contract directlt iith Dell to purchasee install and support the hardiaree ihile Teradata supports the Hortoniorrs sofiaret The Teoadata Sfftaoe-Only ffo Hadffp mflutfn is the Hortoniorrs Data Platorm sofiare iith support prorided bt Teradatat Teoadata Pfmitfninge Teradataes strategt is to proride customers iith three main ststem choicese Teradata for data iarehousinge Aster for complex data analtsis and discorerte and Hadoop (ihen needed) for

Page 42: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 42

VEND,R ASSESSMENTS

handling, staging and refining large rolumes of mult-structured datat Teradataes directon is to proride a single applicaton and ststems management interface to these systems and to enable the systems to exchange and share data iith each othert This is rnoin as the Teradata Unified Data Architecturet

Databame and Analytc Seoveo Vendfom There are a groiing number database ststems and analttc serrer products that are designed exclusirelt for use in a Hadoop enrironmentt Mant of them proride alternatre solutons to existng relatonal ststems and analttc productst Hoierere as Hadoop begins to be deploted in more traditonal and conserratre enterprisese there iill be a need for more familiar approaches to managing and analtzing Hadoop datat This secton rerieis tio products that use a combinaton of existng and nei technologies to support the management and analtsis of Hadoop datat Hadapt (ttt.hadapt.cfm)

Hadapt is a V -funded sofiare compant that prorides SeQ-driren database technologt for leading Apache Hadoop distributonst Drt Daniel Abadie a professor at Yale Unirersitte Kamil ajda-Pailiroisri and ustn orgmane also from Yalee founded the compant in 21 t The technologt is based on Drt Abadies research into combining relatonal database technologt iith Hadoop sofiaret Hadffp Sflutfns Hadapt Adaptre Analttc Platormt The Hadapt Adaptve Analytc Platfom is a nei SeQ-driren database ststem that ias dereloped from the ground up to run on Hadoop and support multple storage engines—HD S for unstructured and mult-structured datae and a relatonal store for structured datat This approach allois the ststem be used for both batch and interactre quert processingt A Hadapt Data Qoader is included that loads data in parallelt or performance and arailabilitte data can be parttoned and replicated across the nodes of a Hadoop clustert Hadapt is building out its SQL support, and its design goal is to be fully ANSI compliantt The product comes iith a Hadapt Derelopment Kit (HDK) that enables derelopers to create additonal analttc functons accessible ria SeQt These functons are iriten in rarious languagese the most popular being arat

Page 43: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 43

VEND,R ASSESSMENTS

Hadapt does not include a Hadoop distributont nstead the compant iorrs iith customers to deplot the product on Apache Hadoop or Hadoop distributons from companies such as loudera, Hortoniorrs and MapRt Data Accemm and Manipulatfns Data on a Hadapt Hadoop system is accessed using SeQe MapReduce applicatons or full text searcht Hadapt supports SeQ embedded in MapReduce code and MapReduce functons embedded in SeQ statementst Support for Tableau Sofiarees risual analttcs product is also arailablet SQL Suppfots All SeQ in the Hadapt enrironment is processed bt the Adaptre euert Executon enginet This engine emplots a cost-based quert optmizer and patent-pending split quert executon technologt to translate incoming queries into a combinaton of SeQ requests and MapReduce operatons to be run in parallel on the Hadoop clustert The optmizer tares into account data parttoning and distributone indexes and statstcs about the data to create an inital quert plant n some casese the optmizer mat choose not to create MapReduce code and directly access data in HD S or relatonal storage directlt to proride fast performancet Hadapt mat dtnamicallt adjust the inital quert plan at run tme to optmize and balance node utlizaton and quert performancet Hadapt Pfmitfnings Hadaptes technologt is unique in that it prorides a single integrated Hadoop database system for managing and processing both relatonal data and mult-structured datat This is in contrast to most enterprise database management rendors ihose approach is to build connectors betieen their existng relatonal database products and a Hadoop enrironmentt These connectors can limit both functonalitt and performance, ihile also leading to multple copies of the datat Hadapt is a V -bacred start-up company, and its technology is deployed currentlt in both enterprise and departmental configuratonst t is iell suited as a line-of-business ststem for doing inrestgatre computng and data discorert and analtsis against large rolumes of structured and mult-structured datat

SAS (ttt.mam.cfm/mfftaoe/vimual-analytcm/technflfgy.html)

SAS is a compant that has been in the analttcs business for mant tearse and it is an industrt innorator in this areat As ststem hardiare and data management and analttcs sofiare has erolred to offer improred pricetperformancee SAS has successfullt groin its product set to tare adrantage of these performance improrementst

Page 44: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 44

VEND,R ASSESSMENTS

SAS has sereral projects that are designed to exploit modern hardiare and database management technologiest Tio important ones are in-dwtwbwse wnwlrtcs and in-memorr wnwlrtcst These tio technologies are not mutuallt exclusiree but instead are designed to iorr togethert ith SAS In-Database analttcse the objectre is to improre performance bt moring the processing to the datae rather than the data to the processingt Performance is enhanced by reducing data tradc betieen ststems and enabling the analttc processing to lererage the parallel processing capabilites of a database ststemt SAS has partnerships iith companies such as IBM, Pirotal (formerlt reenplum) and Teradata Aster to proride this facilittt SAS In-Memort analttcs is designed to exploit the industrt trend toiard exploitng a broader spectrum of data sourcese loier main memort costs, and increasing serrer main memort sizest Processing data in memort can dramatcallt improre the performance of complex analttc iorrloadst t is this capabilitt that underlines Hadoop support bt SASt Hadffp Sflutfns SAS LASR Analttc Serrer The SAS LASR Analytc Seoveo is a high-performance in-memort analttc engine designed to address a rariett of different analttc use casest The product can run on its oin standalone Red Hat Qinux SMP serrer or be deployed on a massirelt parallel hardiare cluster running Hadoop, Pirotal or Teradata sofiaree for examplet n a Hadoop hardiare cluster enrironmente the SAS QASR Analttc Serrer runs on each node of the clustert t can directlt access and persist data in HD Se itete it does not use Hadoop Hire or MapReducet To aid performancee it prorides its oin HD S file format for persistng data, ihich enables the fast loading of HD S data into the memort of the LASR serrer for analysist The SAS LASR Analttc Serrer supports clusters running either an Apache Hadoop or loudera DH distributont SASes directon is toiard utlizing the SAS QASR Analttc Serrer for sereral of its in-memory analttc tools and applicatonst The first product to support this enrironment is the SAS Visual Analttcs producte ihich allois users to risuallt explore and analtze datae and share the results using ieb and mobile interfacest SAS High-Performance Analttcs products also lererage the SAS LASR Analttc Serrer for in-memory analttc model derelopment using data stored in Hadoopt SAS High-Performance Analttcs products and SAS

Page 45: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 45

VEND,R ASSESSMENTS

Visual Analttcs can access and share the same data stored in the LASR Analttc Serrert SAS has also enhanced its SAStA ESS interface to support access to Hadoop data using Hiret This iille for examplee alloi SAS data integraton and data qualitt management products to access Hadoop datae and support data transformatons iriten in Pige Hire and MapReducet SAStA ESS can also be used to access non-Hire data such as delimitede MQ and custom file formatst SAS Pfmitfninge ,ne of the ret benefits of technologies such as the SAS QASR Analttc Serrer is the abilitt to dramatcallt improre the performance for analtzing all forms of datae be thet structured or mult-structuredt SAS is a leader in line-of-business analttc applicatons in areas such as fraud detectone risr management and ant-money launderingt The abilitt to run analttcs and models for those applicatons against larger rolumes and rarietes of data iith faster performance prorides SAS iith an important compettre adrantaget The SAS QASR Analttc Serrer iill ineritablt be compared iith SAPes HANA in-memort appliancet The productse hoierere are differentt SAP HANA is an in-memory database engine iith built-in analttc processingt All the data must be managed in memortt The SAS LASR Analttc Serrer on the other hand is an in-memort analttc enginee not a database enginet It is designed for fast analttc operatons such as clustering, decision treese regressions and correlatonst

Page 46: Hadoop Data Management Platfomms Maoket Segmentatfn and ...docs.media.bitpipe.com/io_10x/io_108315/item_639580... · ability to operate Hadoop in a cloud-computng enrironment is a

ABOUT THE AUTHOR

ABOUT TECHTARGETs TechTarget publishes media for informaton technologt professionalst More than 11 focused iebsites enable quicr access to a deep store of neise adrice and analtsis about the technologies, products and processes crucial to tour jobt ,ur lire and rirtual erents gire you direct access to independent expert commentart and adricet At T Knoiledge Exchange, our social community, you can get adrice and share solutons iith peers and expertst

Hadoop Data Management Platffomms Maoket Segmentatifn and Pofduct Pfmitifning 46

COLIN WHITE is the founder and president of BI Research and president of Data ase Associates nct As an analtste educator and iriter, he is iell-rnoin for his in-depth rnoiledge of database managemente data integratone data iarehousinge and business intelligence technologiest ith mant tears of T experiencee he has consulted for dozens of companies throughout the iorld and is a frequent spearer at

leading T erentst hite has iriten numerous artcles and papers on deploting nei and erolring informaton technologies for a iide rariett of print and ieb-based journalst or 1 tears he ias the conference director of the D tE P, trade shoi and conferencet He ias also the conference chair for mant tears of the Shared nsights Portalse ontent Managemente and ollaboraton onferencet Email him at [email protected]

HadoopeDw weMwnwgrmrn eP wtoams:

Mwakr eSrgmrn wtonewndePaoduc ePosltonlng is a BI Leadership e-publicatont

Wayne Eckeomfn

Director, BI Leadership

Jean Schaueo Editor in Chief

Annie Mathetm Director of Sales

[email protected]

TechTaoget 275 rore Streete Neitone MA 12466 ttt.techtaoget.cfm

© 21 3 TechTarget nct No part of this publicaton mat be transmited or reproduced in ant form

or bt ant means iithout iriten permission from the publishert TechTarget reprints are arailable

through The YGS Gofupt