Hadoop - Past, Present and Future - v1.1

04/07/2023

Prepared for:

Presented by:“Big Data Joe” Rossi@bigdatajoerossi

HadoopPast, Present and Future

Roadmap

~45mins

1- What Makes Up Hadoop 1.x?

2- What’s New In Hadoop 2.x?

3- The Future Of Hadoop …

What Makes Up Hadoop 1.x?

Hadoop 1.0: HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

JobTracker

Client1-1

1-21-3

Hadoop 1.0: HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

JobTracker

Client1-1 1-2

ReduceMap

2-1 3-2 3-3 4-1

2-3 4-2 2-2 3-1 4-3

ReduceMap

MapReduce v1 LimitationsScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

AvailabilityJobTracker failure kills all queued and running jobs

Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization

No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else

HADOOP 1.0

Single Use SystemBatch Apps

Apache Hadoop 1.0: Single Use System

HDFS(redundant, reliable storage)

MapReduce(cluster resource management and data

processing)

Pig Hive

What’s New In Hadoop 2.x?

YARN Replaces MapReduce

Yet Another Resource Negotiator

YARN will be the de-facto distributed operating system for Big Data

Store DATA in one place

YARN: Taking Hadoop Beyond Batch

Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service

Applications Run Natively IN Hadoop

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

BATCH(MapReduce)

INTERACTIVE(Tez)

ONLINE(HBase)

STREAMING(DataTorrent)

GRAPH(Giraph)

Running all on the same Hadoop cluster to give applications access to all the same source data!

YARN: Applications

MapReduce v2

Stream Processing

Master-WorkerOnline

In-Memory

Apache Storm

YARN: Moving QuicklyConceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1GA Released – 2.2

100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily

Version 2.3 Version 2.4

YARN: Dr. Evil Approved

YARN: What Has Changed?YARN MRv1RM

ResourceManager

AMApplicationMaster

JTJobTracker

Scheduler Scheduler

NMNodeManager

TTTaskTracker

ContainerMap

Reduce

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicationMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce

ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of data

6 Benefits of YARN

The Future of HadoopProjects and Roadmap

SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.

Stinger: Interactive Query for Hive

SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.

ScaleThe only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.

Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.

HOYA: HBase (NoSQL) on YARN

Easier DeploymentAPIs to create, start, stop and delete HBase clusters.

AvailabilityRecover from Region Server loss with a new container.

Machine LearningFramework well suited for building machine learning jobs.

Microsoft REEF

Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.

Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.

RetainableEvaluatorExecutionFramework

Heterogeneous Storages in HDFS

NameNode

Storage

NameNode

SATA SSD Fusion IO

Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades

Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration

Hadoop Roadmap

RELEASEDEARLY

Q2 2014

MIDQ2 2014

I Know You Have Questions …No such thing as a stupid question.

Hadoop: Past, Present and Future

SD Big Data Meetup

One Last Thing …

meetup.com/sdbigdata2nd Wednesday Of The MonthNext: July 9st @ 5:45P

Thank You!

Hadoop: Past, Present and Future

Big Data Joe Rossihttp://bigdatajoe.io/@bigdatajoerossi

Supporting SlidesSlides with information that may be asked

YARN: How It Works

ResourceManager

NodeManager

ApplicationMaster

NodeManager

NodeManager NodeManager

Scheduler

Container

Container Container

Client

YARN: Example App Deployment

ResourceManager

NodeManager

HOYA / HBase Master

NodeManager

NodeManager NodeManager

Scheduler

Region Server

Region Server Region Server

HOYA Client

Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 3

Dynamic Runtime Changes 3

Linear Scalability 3

State Checkpointing 3

Apache Spark + Shark

Apache Spark

Hive(sql)

Hadoop 2.x – YARN + HDFS

NameNode

DataNode / NodeManager DataNode / NodeManager

StandbyNameNode /

ResourceManager

ContainerContainer

Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.

YARN: Key Take-Aways

Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.

One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.

Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.

YARN will be the de-facto distributed operating system for Big Data

Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm

Proprietary / Open Source O OSupport for Hadoop 1.x 1 1

Support for Hadoop 2.x 1 1

Native YARN 1 3

Dashboard 1 3

Extensible via Modules 1 1

Technical Support 1 1

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 1 3

Dynamic Runtime Changes 1 3

High Availability 1 2

Prog. Languages Supported Java, Python, etc. Java, Python, etc.

Log Analysis 1 3

Site Operations 1 3

MapReduce Diagnostics 1 3

Open Source Operators Library 1 2

Open Source Application Templates 1 3

Complex Computations (DAG) 1 3

Linear Scalability 1 3

Security 1 3

CLI and Macros 1 3

Configuration Based Specification 1 3

State Checkpointing 1 3

Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases

The 1st Generation Of Hadoop

Hadoop

Apache Spark

Hive(sql)

Spark Streaming

MLib(machine learning)

Project Mgt Committee Members

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 2 4 6 8 10 12 14 16

Project Committers

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 5 10 15 20 25 30

YARN: Why The De-Facto Distributed OS

Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily

Enables InnovationSmart people and companies to do amazing things to data

Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the

2013 alone

Apache Storm Topology

Bolt(Filter)Spout

Stream(Data Source)

Bolt(RDBMS Writes)

Bolt(Calculation)

Bolt(HDFS Writes)

HDFS Write Data FlowNameNode

Client

DataNode DataNode DataNode

3Block Bytes

Block Bytes Block Bytes

Block Write Complete

AckAck

Hadoop - Past, Present and Future - v1.1

Data & Analytics

Curso Hadoop. FcoJavierLahozSevilla v1.0.pdf · Introducción+a Hadoop. InstalaciónenAWS • Parte+1.+Introducción+a Hadoop+ – ¿Que+es+Hadoop?+ – Versionesde+Hadoop+ – Gesón

Hadoop Online Tutorials - indiatrainings.in · Menu Search Hadoop Online Tutorials Author REPLY #1825 Hadoop Eco System › Forums › Hadoop Discussion Forum › 250 Hadoop Interview

myHadoop - Hadoop-on-Demand on Traditional HPC Resources · written using open source MapReduce tools such as Apache Hadoop. In the past, these users have had a hard time run-ning

Hadoop Present - Open Enterprise Hadoop

Hadoop 1.0 vs Hadoop 2.0

Hadoop Installation Guide | Hadoop Configuration

Hadoop 3 (2017 hadoop taiwan workshop)

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop Hadoop & Spark meetup - Altiscale

Configuración para Hadoop Configuración de WPS para Hadoop · Configuración para Hadoop Versión 4.2 Introducción ¿Qué es Hadoop? Hadoop esun marco de trabajo del software de

Hadoop @ eBay: Past, Present and Future

domo elite V1.1 MANUAL DEL USUARIO - … · domo_elite V1.1 MANUAL DEL USUARIO . ... 3 INTRODUCCIÓN AL SISTEMA “domo_elite v1.1” El sistema domo_elite v1.1 es un …

Hadoop Deployment Manual - Hyadespleiades.ucsc.edu/doc/bright/hadoop-deployment-manual.pdf2.2 Ncurses Installation Of Hadoop Using cm-hadoop-setup ... •The Hadoop Deployment Manual

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

Hadoop - Past, Present and Future - v2.0

Hadoop & Security - Past, Present, Future

Introduction to Hadoop and Hadoop component

Analyzing Hadoop with Hadoop

Hadoop Summit San Jose 2015: YARN - Past, Present and Future