Hadoop Innovation Summit 2014

THE FUTURE OF HADOOP: CHOOSING THE RIGHT

OPTIONS

Subash D’SouzaHadoop Innovation Summit 2014

WHO AM I?

Recognized as a Champion of Big Data by Cloudera Co-Organizer - Los Angeles Hadoop User Group Organizer - Los Angeles HBase User Group Organizer – Los Angeles Big Data Users Group Organizer - Big Data Camp LA Speaker – Big Data Camp LA 2013 Leading a BOF Session at Hadoop Summit Europe 2014 Author – HBase Developer’s Cookbook (Out Fall 2014) Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop

HADOOP: OLD & NEW

Hadoop first released in 2006. Based on the GFS and MapReduce papers released by Google Ever since adoption has been massive and rapid Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the Social Security Administration are adopting Hadoop

Hadoop 2.0 AKA YARN went GA in September of 2013 Is backwards compatible with Hadoop 1.0 API’s Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager and Node Managers

A BRIEF HISTORY

2002 2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Doug Cutting

launches Nutch project

Google releases

GFS paper

Google releases

MapReduce paper

MapReduce implemented in Nutch

Nutch adds distributed file system

Hadoop spun out of

Nutch project at

Yahoo

Hadoop breaks

Terasort world record

Cloudera founded

MapR founded

Hortonworks founded

Hadoop 2.0 w/HA

available

Impala(SQL on Hadoop) launched

YARN goes GA

Stinger/ Tez to be released

HBase, Zookeeper, Flume and

more added to CDH

PREVIOUSLY, THE STATE OF DATAAs a data analyst, previously, you were not able to ask questions you wanted to ask because you did not have the data points availableCorollary, you couldn’t think of questions to ask of your data because you didn’t know you had access to those data points

BIG DATA IMPACT

FOCUS

No standard way to get to the data This is a plus and minus, plus because there is variety to choose from, minus because the no. of tools to pull the data is huge and evermore expanding

As a company what do you choose?What do you focus on?Question – Do you replace your current data infrastructure or do you augment it?

HADOOP TECHNOLOGIES

DISTRIBUTIONS OF HADOOP

ApacheHortonworksClouderaMapRIntelIBMPivotal

HORTONWORKS HDP 2.0

Source: hortonworks.com

CLOUDERA ENTERPRISE DATA HUB

Source: cloudera.com & techweekly.com

MAPR M7 ENTERPRISE

Source: business-software.com & wn.com

INTEL DISTRIBUTION FOR APACHE HADOOP

Source: gigaom.com

IBM BIGINSIGHTS ENTERPRISE EDITION

Source: ndm.net

PIVOTAL HD

Source: infoq.com

CHOICES

Hortonworks – Completely Open Source – Everything on their platform is available from Apache Hadoop Distribution. Available as a free download or with paid support.

Cloudera – Offers the open source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. Available as a free download or with paid support with the additional tools

MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary MFS(MapR File System). Everything else on their stack is based on the open source Apache distribution. Offers a free M3 version along with paid M5 and M7 versions.

ADVANTAGES OF YARN

Ability to handle multi tenant clients, i.e. running multiple applications atop the same framework(multi-tenancy)

Splits the work of Job tracker into Resource Manager and Application master so Job tracker does not have to allocate resources as well as manage the tasks

Ability to restart Jobs from the place where they failedScales well beyond the limitations of MR1(4000 nodes)Backward Support for Hadoop 1.0

SQL-ON-HADOOP

The different SQL-On-Hadoop tools currently availableHiveImpalaDrillStinger/TezHAWQHadaptPrestoShark

SQL-ON-HADOOP BENCHMARK - SCAN

Source: amplab.cs.berkeley.edu/benchmark

SQL-ON-HADOOP BENCHMARK - AGGREGATE


SQL-ON-HADOOP BENCHMARK - JOIN


SQL ON HADOOP VS. TRADITIONAL RDBMSData on Hadoop is not as responsive as a RDBMSData in Hadoop can scale much better than an RDBMS

Data in Hadoop can be accessed using a variety of mechanisms such as Hive, Imapala, Drill, etc. i.e. the query engines are abstracted from the Hadoop(HDFS) storage layer. The same cannot be said of RDBMS where you would need between one system to another example, Oracle cannot pull from SQL Server and vice versa

QUESTION?

Do we augment or replace our current data infrastructure?Answer – AugmentWhy? – combine the best of both worlds, use aggregated data in your data stores and all the detail data and lifetime in HadoopOf course, you will different SLA’s based on the query you ask.

CHALLENGES

Data ProtectionSecuritySLA’s – Service Level AgreementsIntegration w/ applicationsServices and supportTrainingPerformanceScaling and Administration

STARTUPS VS. MATURE

Startups that are in data should make the consideration of going with YARN to gain the advantages of YARNMature companies tend to be conservative and hence will look to the more established use cases of MR1Startups and Mature companies should look at the advantages of YARN as well as applying more near real-time sql-on-hadoop

GETTING STARTED WITH HADOOP VS. ESTABLISHED HADOOP PRACTICESGetting started with Hadoop – Opportunity to get off the ground running YARN plus bleeding edge technologies.Established companies with a Hadoop practice tend to be conservative but that shouldn’t prevent them from coming with a migration plan to YARN

REAL TIME ANALYTICS

Kiji HBase Storm Shark Redshift Impala Stinger Drill Accumolo Presto Hawq IBM BigSQL

REAL TIME STREAMING

FlumeKafkaScribeHBase

SECURITY

Kerberos with ACL’sCloudera SentryProject KnoxAccumolo(BigTable clone)HBase w/Cell Security

DEVELOPERS TOOLSET

Cloudera CDK renamed to KiteJava M/RSpring for HadoopHivePigScaldingImpalaOthers

MANAGEMENT, GUI, MACHINE LEARNING, MONITORING, SCHEDULING & GRAPH DBAmbariCloudera ManagerHUEMahoutGiraphZookeeperOozie

FUTURE OF HADOOP: YARN & NEAR REAL TIME SQL-ON-HADOOPMulti TenancyHA(High Availability)Tools for SQL-On-HadoopImpalaStinger/TezDrillShark

WHAT DO YOU CHOOSE?

The choices are hugeThe toolsets are variedFirst focus on the problems you are trying to solve. Don’t choose Hadoop because it is the latest buzz word. Make sure there is a real need to solve

Focus on developers and administrators and ensure that whatever toolset you choose, they have the relevant skillset or training will be provided or relevant resources will be brought in from outside( whether through hiring or consulting)

REMEMBER PROBLEMSET!!! i.e what you are trying to solve

CAVEATS

Work still being done on bringing real time sql-on-hadoop to YARN.Impala has Llama for this.Stinger for Hive Preview is currently availableHBase on YARN(HOYA) is also actively being worked on.Since YARN is a low level API, some abstraction is needed which is available with tools such as Samza and Weave

BIG DATA = BIG IMPACT

Ken Rudin, Director of Analytics, Facebook“You need to go the last mile and evangelize your insights so that people actually act on them and there is impact."“It doesn’t matter how brilliant our analyses are. If nothing changes we have made no impact”

GIVING BACK

Hadoop is an open source projectWork done on this and the ecosystem tools are by committers and contributors, some of whom do this in their own personal time, in reporting and fixing bugs as well as new functionality.

Please give back either by becoming a contributor(Testing, filing bugs) or getting out your use case for Hadoop(at meetups and/or conferences such as this one) so others can make use of the issues you have faced as well see the rapid adoption of the Hadoop ecosystem toolset.

THANKS

Subash D’SouzaTwitter: @sawjd22Linkedin: www.linkedin.com/in/sawjd/Email: [email protected]

Technology

Hadoop Innovation Summit 2014