37
THE FUTURE OF HADOOP: CHOOSING THE RIGHT OPTIONS Subash D’Souza Hadoop Innovation Summit 2014

Hadoop Innovation Summit 2014

Embed Size (px)

DESCRIPTION

Slides from my presentation at Hadoop Innovation Summit 2014 The Future of Hadoop: Choosing the right options

Citation preview

Page 1: Hadoop Innovation Summit 2014

THE FUTURE OF HADOOP: CHOOSING THE RIGHT

OPTIONS

Subash D’SouzaHadoop Innovation Summit 2014

Page 2: Hadoop Innovation Summit 2014

WHO AM I?

Recognized as a Champion of Big Data by Cloudera Co-Organizer - Los Angeles Hadoop User Group Organizer - Los Angeles HBase User Group Organizer – Los Angeles Big Data Users Group Organizer - Big Data Camp LA Speaker – Big Data Camp LA 2013 Leading a BOF Session at Hadoop Summit Europe 2014 Author – HBase Developer’s Cookbook (Out Fall 2014) Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop

Page 3: Hadoop Innovation Summit 2014

HADOOP: OLD & NEW

Hadoop first released in 2006. Based on the GFS and MapReduce papers released by Google Ever since adoption has been massive and rapid Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the Social Security Administration are adopting Hadoop

Hadoop 2.0 AKA YARN went GA in September of 2013 Is backwards compatible with Hadoop 1.0 API’s Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager and Node Managers

Page 4: Hadoop Innovation Summit 2014

A BRIEF HISTORY

2002 2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Doug Cutting

launches Nutch project

Google releases

GFS paper

Google releases

MapReduce paper

MapReduce implemented in Nutch

Nutch adds distributed file system

Hadoop spun out of

Nutch project at

Yahoo

Hadoop breaks

Terasort world record

Cloudera founded

MapR founded

Hortonworks founded

Hadoop 2.0 w/HA

available

Impala(SQL on Hadoop) launched

YARN goes GA

Stinger/ Tez to be released

HBase, Zookeeper, Flume and

more added to CDH

Page 5: Hadoop Innovation Summit 2014

PREVIOUSLY, THE STATE OF DATAAs a data analyst, previously, you were not able to ask questions you wanted to ask because you did not have the data points availableCorollary, you couldn’t think of questions to ask of your data because you didn’t know you had access to those data points

Page 6: Hadoop Innovation Summit 2014

BIG DATA IMPACT

Page 7: Hadoop Innovation Summit 2014

FOCUS

No standard way to get to the data This is a plus and minus, plus because there is variety to choose from, minus because the no. of tools to pull the data is huge and evermore expanding

As a company what do you choose?What do you focus on?Question – Do you replace your current data infrastructure or do you augment it?

Page 8: Hadoop Innovation Summit 2014

HADOOP TECHNOLOGIES

Page 9: Hadoop Innovation Summit 2014

DISTRIBUTIONS OF HADOOP

ApacheHortonworksClouderaMapRIntelIBMPivotal

Page 10: Hadoop Innovation Summit 2014

HORTONWORKS HDP 2.0

Source: hortonworks.com

Page 11: Hadoop Innovation Summit 2014

CLOUDERA ENTERPRISE DATA HUB

Source: cloudera.com & techweekly.com

Page 12: Hadoop Innovation Summit 2014

MAPR M7 ENTERPRISE

Source: business-software.com & wn.com

Page 13: Hadoop Innovation Summit 2014

INTEL DISTRIBUTION FOR APACHE HADOOP

Source: gigaom.com

Page 14: Hadoop Innovation Summit 2014

IBM BIGINSIGHTS ENTERPRISE EDITION

Source: ndm.net

Page 15: Hadoop Innovation Summit 2014

PIVOTAL HD

Source: infoq.com

Page 16: Hadoop Innovation Summit 2014

CHOICES

Hortonworks – Completely Open Source – Everything on their platform is available from Apache Hadoop Distribution. Available as a free download or with paid support.

Cloudera – Offers the open source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. Available as a free download or with paid support with the additional tools

MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary MFS(MapR File System). Everything else on their stack is based on the open source Apache distribution. Offers a free M3 version along with paid M5 and M7 versions.

Page 17: Hadoop Innovation Summit 2014

ADVANTAGES OF YARN

Ability to handle multi tenant clients, i.e. running multiple applications atop the same framework(multi-tenancy)

Splits the work of Job tracker into Resource Manager and Application master so Job tracker does not have to allocate resources as well as manage the tasks

Ability to restart Jobs from the place where they failedScales well beyond the limitations of MR1(4000 nodes)Backward Support for Hadoop 1.0

Page 18: Hadoop Innovation Summit 2014

SQL-ON-HADOOP

The different SQL-On-Hadoop tools currently availableHiveImpalaDrillStinger/TezHAWQHadaptPrestoShark

Page 19: Hadoop Innovation Summit 2014

SQL-ON-HADOOP BENCHMARK - SCAN

Source: amplab.cs.berkeley.edu/benchmark

Page 20: Hadoop Innovation Summit 2014

SQL-ON-HADOOP BENCHMARK - AGGREGATE

Source: amplab.cs.berkeley.edu/benchmark

Page 21: Hadoop Innovation Summit 2014

SQL-ON-HADOOP BENCHMARK - JOIN

Source: amplab.cs.berkeley.edu/benchmark

Page 22: Hadoop Innovation Summit 2014

SQL ON HADOOP VS. TRADITIONAL RDBMSData on Hadoop is not as responsive as a RDBMSData in Hadoop can scale much better than an RDBMS

Data in Hadoop can be accessed using a variety of mechanisms such as Hive, Imapala, Drill, etc. i.e. the query engines are abstracted from the Hadoop(HDFS) storage layer. The same cannot be said of RDBMS where you would need between one system to another example, Oracle cannot pull from SQL Server and vice versa

Page 23: Hadoop Innovation Summit 2014

QUESTION?

Do we augment or replace our current data infrastructure?Answer – AugmentWhy? – combine the best of both worlds, use aggregated data in your data stores and all the detail data and lifetime in HadoopOf course, you will different SLA’s based on the query you ask.

Page 24: Hadoop Innovation Summit 2014

CHALLENGES

Data ProtectionSecuritySLA’s – Service Level AgreementsIntegration w/ applicationsServices and supportTrainingPerformanceScaling and Administration

Page 25: Hadoop Innovation Summit 2014

STARTUPS VS. MATURE

Startups that are in data should make the consideration of going with YARN to gain the advantages of YARNMature companies tend to be conservative and hence will look to the more established use cases of MR1Startups and Mature companies should look at the advantages of YARN as well as applying more near real-time sql-on-hadoop

Page 26: Hadoop Innovation Summit 2014

GETTING STARTED WITH HADOOP VS. ESTABLISHED HADOOP PRACTICESGetting started with Hadoop – Opportunity to get off the ground running YARN plus bleeding edge technologies.Established companies with a Hadoop practice tend to be conservative but that shouldn’t prevent them from coming with a migration plan to YARN

Page 27: Hadoop Innovation Summit 2014

REAL TIME ANALYTICS

Kiji HBase Storm Shark Redshift Impala Stinger Drill Accumolo Presto Hawq IBM BigSQL

Page 28: Hadoop Innovation Summit 2014

REAL TIME STREAMING

FlumeKafkaScribeHBase

Page 29: Hadoop Innovation Summit 2014

SECURITY

Kerberos with ACL’sCloudera SentryProject KnoxAccumolo(BigTable clone)HBase w/Cell Security

Page 30: Hadoop Innovation Summit 2014

DEVELOPERS TOOLSET

Cloudera CDK renamed to KiteJava M/RSpring for HadoopHivePigScaldingImpalaOthers

Page 31: Hadoop Innovation Summit 2014

MANAGEMENT, GUI, MACHINE LEARNING, MONITORING, SCHEDULING & GRAPH DBAmbariCloudera ManagerHUEMahoutGiraphZookeeperOozie

Page 32: Hadoop Innovation Summit 2014

FUTURE OF HADOOP: YARN & NEAR REAL TIME SQL-ON-HADOOPMulti TenancyHA(High Availability)Tools for SQL-On-HadoopImpalaStinger/TezDrillShark

Page 33: Hadoop Innovation Summit 2014

WHAT DO YOU CHOOSE?

The choices are hugeThe toolsets are variedFirst focus on the problems you are trying to solve. Don’t choose Hadoop because it is the latest buzz word. Make sure there is a real need to solve

Focus on developers and administrators and ensure that whatever toolset you choose, they have the relevant skillset or training will be provided or relevant resources will be brought in from outside( whether through hiring or consulting)

REMEMBER PROBLEMSET!!! i.e what you are trying to solve

Page 34: Hadoop Innovation Summit 2014

CAVEATS

Work still being done on bringing real time sql-on-hadoop to YARN.Impala has Llama for this.Stinger for Hive Preview is currently availableHBase on YARN(HOYA) is also actively being worked on.Since YARN is a low level API, some abstraction is needed which is available with tools such as Samza and Weave

Page 35: Hadoop Innovation Summit 2014

BIG DATA = BIG IMPACT

Ken Rudin, Director of Analytics, Facebook“You need to go the last mile and evangelize your insights so that people actually act on them and there is impact."“It doesn’t matter how brilliant our analyses are. If nothing changes we have made no impact”

Page 36: Hadoop Innovation Summit 2014

GIVING BACK

Hadoop is an open source projectWork done on this and the ecosystem tools are by committers and contributors, some of whom do this in their own personal time, in reporting and fixing bugs as well as new functionality.

Please give back either by becoming a contributor(Testing, filing bugs) or getting out your use case for Hadoop(at meetups and/or conferences such as this one) so others can make use of the issues you have faced as well see the rapid adoption of the Hadoop ecosystem toolset.

Page 37: Hadoop Innovation Summit 2014

THANKS

Subash D’SouzaTwitter: @sawjd22Linkedin: www.linkedin.com/in/sawjd/Email: [email protected]