Upload
subash-dsouza
View
111
Download
0
Embed Size (px)
DESCRIPTION
Slides from my presentation at Hadoop Innovation Summit 2014 The Future of Hadoop: Choosing the right options
Citation preview
THE FUTURE OF HADOOP: CHOOSING THE RIGHT
OPTIONS
Subash D’SouzaHadoop Innovation Summit 2014
WHO AM I?
Recognized as a Champion of Big Data by Cloudera Co-Organizer - Los Angeles Hadoop User Group Organizer - Los Angeles HBase User Group Organizer – Los Angeles Big Data Users Group Organizer - Big Data Camp LA Speaker – Big Data Camp LA 2013 Leading a BOF Session at Hadoop Summit Europe 2014 Author – HBase Developer’s Cookbook (Out Fall 2014) Technical Reviewer – Apache Flume: Distributed Log Collection for Hadoop
HADOOP: OLD & NEW
Hadoop first released in 2006. Based on the GFS and MapReduce papers released by Google Ever since adoption has been massive and rapid Companies like Facebook, Netflix, EBay, Yahoo, Expedia, Spotify and even the Social Security Administration are adopting Hadoop
Hadoop 2.0 AKA YARN went GA in September of 2013 Is backwards compatible with Hadoop 1.0 API’s Replaced Jobtracker and Tasktrackers with Application Master, Resource Manager and Node Managers
A BRIEF HISTORY
2002 2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Doug Cutting
launches Nutch project
Google releases
GFS paper
Google releases
MapReduce paper
MapReduce implemented in Nutch
Nutch adds distributed file system
Hadoop spun out of
Nutch project at
Yahoo
Hadoop breaks
Terasort world record
Cloudera founded
MapR founded
Hortonworks founded
Hadoop 2.0 w/HA
available
Impala(SQL on Hadoop) launched
YARN goes GA
Stinger/ Tez to be released
HBase, Zookeeper, Flume and
more added to CDH
PREVIOUSLY, THE STATE OF DATAAs a data analyst, previously, you were not able to ask questions you wanted to ask because you did not have the data points availableCorollary, you couldn’t think of questions to ask of your data because you didn’t know you had access to those data points
BIG DATA IMPACT
FOCUS
No standard way to get to the data This is a plus and minus, plus because there is variety to choose from, minus because the no. of tools to pull the data is huge and evermore expanding
As a company what do you choose?What do you focus on?Question – Do you replace your current data infrastructure or do you augment it?
HADOOP TECHNOLOGIES
DISTRIBUTIONS OF HADOOP
ApacheHortonworksClouderaMapRIntelIBMPivotal
HORTONWORKS HDP 2.0
Source: hortonworks.com
CLOUDERA ENTERPRISE DATA HUB
Source: cloudera.com & techweekly.com
MAPR M7 ENTERPRISE
Source: business-software.com & wn.com
INTEL DISTRIBUTION FOR APACHE HADOOP
Source: gigaom.com
IBM BIGINSIGHTS ENTERPRISE EDITION
Source: ndm.net
PIVOTAL HD
Source: infoq.com
CHOICES
Hortonworks – Completely Open Source – Everything on their platform is available from Apache Hadoop Distribution. Available as a free download or with paid support.
Cloudera – Offers the open source Apache Hadoop Distribution as well as management tools built for the Cloudera Distribution. Available as a free download or with paid support with the additional tools
MapR – Offers a version of Hadoop that replaces the HDFS with a proprietary MFS(MapR File System). Everything else on their stack is based on the open source Apache distribution. Offers a free M3 version along with paid M5 and M7 versions.
ADVANTAGES OF YARN
Ability to handle multi tenant clients, i.e. running multiple applications atop the same framework(multi-tenancy)
Splits the work of Job tracker into Resource Manager and Application master so Job tracker does not have to allocate resources as well as manage the tasks
Ability to restart Jobs from the place where they failedScales well beyond the limitations of MR1(4000 nodes)Backward Support for Hadoop 1.0
SQL-ON-HADOOP
The different SQL-On-Hadoop tools currently availableHiveImpalaDrillStinger/TezHAWQHadaptPrestoShark
SQL-ON-HADOOP BENCHMARK - SCAN
Source: amplab.cs.berkeley.edu/benchmark
SQL-ON-HADOOP BENCHMARK - AGGREGATE
Source: amplab.cs.berkeley.edu/benchmark
SQL-ON-HADOOP BENCHMARK - JOIN
Source: amplab.cs.berkeley.edu/benchmark
SQL ON HADOOP VS. TRADITIONAL RDBMSData on Hadoop is not as responsive as a RDBMSData in Hadoop can scale much better than an RDBMS
Data in Hadoop can be accessed using a variety of mechanisms such as Hive, Imapala, Drill, etc. i.e. the query engines are abstracted from the Hadoop(HDFS) storage layer. The same cannot be said of RDBMS where you would need between one system to another example, Oracle cannot pull from SQL Server and vice versa
QUESTION?
Do we augment or replace our current data infrastructure?Answer – AugmentWhy? – combine the best of both worlds, use aggregated data in your data stores and all the detail data and lifetime in HadoopOf course, you will different SLA’s based on the query you ask.
CHALLENGES
Data ProtectionSecuritySLA’s – Service Level AgreementsIntegration w/ applicationsServices and supportTrainingPerformanceScaling and Administration
STARTUPS VS. MATURE
Startups that are in data should make the consideration of going with YARN to gain the advantages of YARNMature companies tend to be conservative and hence will look to the more established use cases of MR1Startups and Mature companies should look at the advantages of YARN as well as applying more near real-time sql-on-hadoop
GETTING STARTED WITH HADOOP VS. ESTABLISHED HADOOP PRACTICESGetting started with Hadoop – Opportunity to get off the ground running YARN plus bleeding edge technologies.Established companies with a Hadoop practice tend to be conservative but that shouldn’t prevent them from coming with a migration plan to YARN
REAL TIME ANALYTICS
Kiji HBase Storm Shark Redshift Impala Stinger Drill Accumolo Presto Hawq IBM BigSQL
REAL TIME STREAMING
FlumeKafkaScribeHBase
SECURITY
Kerberos with ACL’sCloudera SentryProject KnoxAccumolo(BigTable clone)HBase w/Cell Security
DEVELOPERS TOOLSET
Cloudera CDK renamed to KiteJava M/RSpring for HadoopHivePigScaldingImpalaOthers
MANAGEMENT, GUI, MACHINE LEARNING, MONITORING, SCHEDULING & GRAPH DBAmbariCloudera ManagerHUEMahoutGiraphZookeeperOozie
FUTURE OF HADOOP: YARN & NEAR REAL TIME SQL-ON-HADOOPMulti TenancyHA(High Availability)Tools for SQL-On-HadoopImpalaStinger/TezDrillShark
WHAT DO YOU CHOOSE?
The choices are hugeThe toolsets are variedFirst focus on the problems you are trying to solve. Don’t choose Hadoop because it is the latest buzz word. Make sure there is a real need to solve
Focus on developers and administrators and ensure that whatever toolset you choose, they have the relevant skillset or training will be provided or relevant resources will be brought in from outside( whether through hiring or consulting)
REMEMBER PROBLEMSET!!! i.e what you are trying to solve
CAVEATS
Work still being done on bringing real time sql-on-hadoop to YARN.Impala has Llama for this.Stinger for Hive Preview is currently availableHBase on YARN(HOYA) is also actively being worked on.Since YARN is a low level API, some abstraction is needed which is available with tools such as Samza and Weave
BIG DATA = BIG IMPACT
Ken Rudin, Director of Analytics, Facebook“You need to go the last mile and evangelize your insights so that people actually act on them and there is impact."“It doesn’t matter how brilliant our analyses are. If nothing changes we have made no impact”
GIVING BACK
Hadoop is an open source projectWork done on this and the ecosystem tools are by committers and contributors, some of whom do this in their own personal time, in reporting and fixing bugs as well as new functionality.
Please give back either by becoming a contributor(Testing, filing bugs) or getting out your use case for Hadoop(at meetups and/or conferences such as this one) so others can make use of the issues you have faced as well see the rapid adoption of the Hadoop ecosystem toolset.
THANKS
Subash D’SouzaTwitter: @sawjd22Linkedin: www.linkedin.com/in/sawjd/Email: [email protected]