Hadoop - Past, Present and Future - v1.1

05/02/2023

Prepared for:

Presented by:“Big Data Joe” Rossi@bigdatajoerossi

HadoopPast, Present and Future

Roadmap

~45mins

Q&A

1- What Makes Up Hadoop 1.x?

2- What’s New In Hadoop 2.x?

3- The Future Of Hadoop …

What Makes Up Hadoop 1.x?

Hadoop 1.0: HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker


JobTracker

Client1-1

1-21-3

Hadoop 1.0: HDFS + MapReduce

NameNode



JobTracker

Client1-1 1-2

1-3

ReduceMap

2-1 3-2 3-3 4-1

2-3 4-2 2-2 3-1 4-3

ReduceMap

MapReduce v1 LimitationsScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

AvailabilityJobTracker failure kills all queued and running jobs

Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization

No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else

HADOOP 1.0

Single Use SystemBatch Apps

Apache Hadoop 1.0: Single Use System

HDFS(redundant, reliable storage)

MapReduce(cluster resource management and data

processing)

Pig Hive

What’s New In Hadoop 2.x?

YARN Replaces MapReduce

Yet Another Resource Negotiator

YARN

YARN will be the de-facto distributed operating system for Big Data

Store DATA in one place

YARN: Taking Hadoop Beyond Batch

Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service

Applications Run Natively IN Hadoop

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

BATCH(MapReduce)

INTERACTIVE(Tez)

ONLINE(HBase)

STREAMING(DataTorrent)

GRAPH(Giraph)

Running all on the same Hadoop cluster to give applications access to all the same source data!

YARN: Applications

MapReduce v2

Stream Processing

Master-WorkerOnline

In-Memory

Apache Storm

2010

2011

2012

2013

2014

Today

YARN: Moving QuicklyConceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1GA Released – 2.2

100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily

Version 2.3 Version 2.4

YARN: Dr. Evil Approved

YARN: What Has Changed?YARN MRv1RM

ResourceManager

AMApplicationMaster

JTJobTracker

Scheduler Scheduler

NMNodeManager

TTTaskTracker

ContainerMap

Reduce

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicationMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce

ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of data

6 Benefits of YARN

6

The Future of HadoopProjects and Roadmap

SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.

Stinger: Interactive Query for Hive

SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.

ScaleThe only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.

Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.

HOYA: HBase (NoSQL) on YARN

Easier DeploymentAPIs to create, start, stop and delete HBase clusters.

AvailabilityRecover from Region Server loss with a new container.

Machine LearningFramework well suited for building machine learning jobs.

Microsoft REEF

Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.

Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.

RetainableEvaluatorExecutionFramework

Heterogeneous Storages in HDFS

NameNode

Storage

NameNode

SATA SSD Fusion IO

Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades

Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration

Hadoop Roadmap

RELEASEDEARLY

Q2 2014

MIDQ2 2014

I Know You Have Questions …No such thing as a stupid question.

Hadoop: Past, Present and Future

SD Big Data Meetup

One Last Thing …

meetup.com/sdbigdata2nd Wednesday Of The MonthNext: July 9st @ 5:45P

Thank You!

Hadoop: Past, Present and Future

Big Data Joe Rossihttp://bigdatajoe.io/@bigdatajoerossi

Supporting SlidesSlides with information that may be asked

YARN: How It Works

ResourceManager

NodeManager

ApplicationMaster

NodeManager

NodeManager NodeManager

Scheduler

Container

Container Container

Client

YARN: Example App Deployment

ResourceManager

NodeManager

HOYA / HBase Master

NodeManager

NodeManager NodeManager

Scheduler

Region Server

Region Server Region Server

HOYA Client

Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 3

Dynamic Runtime Changes 3

Linear Scalability 3

State Checkpointing 3

Apache Spark + Shark



Apache Spark

Shark

Hive(sql)

Hadoop 2.x – YARN + HDFS

NameNode

DataNode / NodeManager DataNode / NodeManager

DataNode / NodeManager DataNode / NodeManager

StandbyNameNode /

ResourceManager

ContainerContainer

ContainerContainer

ContainerContainer

ContainerContainer

Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.

YARN: Key Take-Aways

Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.

One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.

Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.

YARN will be the de-facto distributed operating system for Big Data

Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm

Proprietary / Open Source O OSupport for Hadoop 1.x 1 1

Support for Hadoop 2.x 1 1

Native YARN 1 3

Dashboard 1 3

Extensible via Modules 1 1

Technical Support 1 1

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 1 3

Dynamic Runtime Changes 1 3

High Availability 1 2

Prog. Languages Supported Java, Python, etc. Java, Python, etc.

Log Analysis 1 3

Site Operations 1 3

MapReduce Diagnostics 1 3

Open Source Operators Library 1 2

Open Source Application Templates 1 3

Complex Computations (DAG) 1 3

Linear Scalability 1 3

Security 1 3

CLI and Macros 1 3

Configuration Based Specification 1 3

State Checkpointing 1 3

Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases

The 1st Generation Of Hadoop

Hadoop

HBase

Apache Spark



Apache Spark

Shark

Hive(sql)

Spark Streaming

MLib(machine learning)

Project Mgt Committee Members

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 2 4 6 8 10 12 14 16

7

6

3

15

11

Project Committers

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 5 10 15 20 25 30

24

24

11

11

5

YARN: Why The De-Facto Distributed OS

Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily

Enables InnovationSmart people and companies to do amazing things to data

Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the

2013 alone

Apache Storm Topology

Bolt(Filter)Spout

Stream(Data Source)

Spout

Stream(Data Source)

Bolt(RDBMS Writes)

Bolt(Calculation)

Bolt(HDFS Writes)

RDBMS

HDFS

HDFS Write Data FlowNameNode

Client

DataNode DataNode DataNode

1

2

4 5

67

3Block Bytes

Block Bytes Block Bytes

Block Write Complete

AckAck

Ack

A

B

C

Data & Analytics

Hadoop - Past, Present and Future - v1.1