Transcript
Page 1: Hadoop - Past, Present and Future - v1.1

05/02/2023

Prepared for:

Presented by:“Big Data Joe” Rossi@bigdatajoerossi

HadoopPast, Present and Future

Page 2: Hadoop - Past, Present and Future - v1.1

Roadmap

~45mins

Q&A

1- What Makes Up Hadoop 1.x?

2- What’s New In Hadoop 2.x?

3- The Future Of Hadoop …

Page 3: Hadoop - Past, Present and Future - v1.1

What Makes Up Hadoop 1.x?

Page 4: Hadoop - Past, Present and Future - v1.1

Hadoop 1.0: HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

DataNode / TaskTracker DataNode / TaskTracker

JobTracker

Client1-1

1-21-3

Page 5: Hadoop - Past, Present and Future - v1.1

Hadoop 1.0: HDFS + MapReduce

NameNode

DataNode / TaskTracker DataNode / TaskTracker

DataNode / TaskTracker DataNode / TaskTracker

JobTracker

Client1-1 1-2

1-3

ReduceMap

2-1 3-2 3-3 4-1

2-3 4-2 2-2 3-1 4-3

ReduceMap

Page 6: Hadoop - Past, Present and Future - v1.1

MapReduce v1 LimitationsScalabilityMaximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

AvailabilityJobTracker failure kills all queued and running jobs

Resources Partitioned into Map and ReduceHard partitioning of Map and Reduce slots led to low resource utilization

No Support for Alternate Paradigms / ServicesOnly MapReduce batch jobs, nothing else

Page 7: Hadoop - Past, Present and Future - v1.1

HADOOP 1.0

Single Use SystemBatch Apps

Apache Hadoop 1.0: Single Use System

HDFS(redundant, reliable storage)

MapReduce(cluster resource management and data

processing)

Pig Hive

Page 8: Hadoop - Past, Present and Future - v1.1

What’s New In Hadoop 2.x?

Page 9: Hadoop - Past, Present and Future - v1.1

YARN Replaces MapReduce

Yet Another Resource Negotiator

YARN

YARN will be the de-facto distributed operating system for Big Data

Page 10: Hadoop - Past, Present and Future - v1.1

Store DATA in one place

YARN: Taking Hadoop Beyond Batch

Interact with that data in MULTIPLE WAYSwith Predictable Performance and Quality of Service

Applications Run Natively IN Hadoop

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

BATCH(MapReduce)

INTERACTIVE(Tez)

ONLINE(HBase)

STREAMING(DataTorrent)

GRAPH(Giraph)

Page 11: Hadoop - Past, Present and Future - v1.1

Running all on the same Hadoop cluster to give applications access to all the same source data!

YARN: Applications

MapReduce v2

Stream Processing

Master-WorkerOnline

In-Memory

Apache Storm

Page 12: Hadoop - Past, Present and Future - v1.1

2010

2011

2012

2013

2014

Today

YARN: Moving QuicklyConceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1GA Released – 2.2

100,000+ nodes, 400,000+ jobs daily10 million+ hours of compute daily

Version 2.3 Version 2.4

Page 13: Hadoop - Past, Present and Future - v1.1

YARN: Dr. Evil Approved

Page 14: Hadoop - Past, Present and Future - v1.1

YARN: What Has Changed?YARN MRv1RM

ResourceManager

AMApplicationMaster

JTJobTracker

Scheduler Scheduler

NMNodeManager

TTTaskTracker

ContainerMap

Reduce

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicationMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce

Page 15: Hadoop - Past, Present and Future - v1.1

ScaleNew programming models and servicesImproved cluster utilizationAgilityBackwards compatible with MapReduce v1Mixed workloads on the same source of data

6 Benefits of YARN

6

Page 16: Hadoop - Past, Present and Future - v1.1

The Future of HadoopProjects and Roadmap

Page 17: Hadoop - Past, Present and Future - v1.1

SpeedDeliver interactive query through 100x performance increases as compared to Hive 10.

Stinger: Interactive Query for Hive

SQLSupport the broadest array of SQL semantics for analytic applications running against Hadoop.

ScaleThe only SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes.

Page 18: Hadoop - Past, Present and Future - v1.1

Dynamic ScalingOn-demand cluster size. Increase and decrease the size with load.

HOYA: HBase (NoSQL) on YARN

Easier DeploymentAPIs to create, start, stop and delete HBase clusters.

AvailabilityRecover from Region Server loss with a new container.

Page 19: Hadoop - Past, Present and Future - v1.1

Machine LearningFramework well suited for building machine learning jobs.

Microsoft REEF

Scalable / Fault TolerantMakes it easy to implement scalable, fault-tolerant runtime environments for a range of computational models.

Maintain StateUsers can build jobs that utilize data from where it’s needed and also maintain state after jobs are done.

RetainableEvaluatorExecutionFramework

Page 20: Hadoop - Past, Present and Future - v1.1

Heterogeneous Storages in HDFS

NameNode

Storage

NameNode

SATA SSD Fusion IO

Page 21: Hadoop - Past, Present and Future - v1.1

Apache Hadoop 2.4ResourceManager HA / Auto FailoverHDFS Rolling Upgrades

Apache Hadoop 2.5NodeManager Restart w/o disruptionDynamic Resource Configuration

Hadoop Roadmap

RELEASEDEARLY

Q2 2014

MIDQ2 2014

Page 22: Hadoop - Past, Present and Future - v1.1

I Know You Have Questions …No such thing as a stupid question.

Hadoop: Past, Present and Future

Page 23: Hadoop - Past, Present and Future - v1.1

SD Big Data Meetup

One Last Thing …

meetup.com/sdbigdata2nd Wednesday Of The MonthNext: July 9st @ 5:45P

Page 24: Hadoop - Past, Present and Future - v1.1

Thank You!

Hadoop: Past, Present and Future

Big Data Joe Rossihttp://bigdatajoe.io/@bigdatajoerossi

Page 25: Hadoop - Past, Present and Future - v1.1

Supporting SlidesSlides with information that may be asked

Page 26: Hadoop - Past, Present and Future - v1.1

YARN: How It Works

ResourceManager

NodeManager

ApplicationMaster

NodeManager

NodeManager NodeManager

Scheduler

Container

Container Container

Client

Page 27: Hadoop - Past, Present and Future - v1.1

YARN: Example App Deployment

ResourceManager

NodeManager

HOYA / HBase Master

NodeManager

NodeManager NodeManager

Scheduler

Region Server

Region Server Region Server

HOYA Client

Page 28: Hadoop - Past, Present and Future - v1.1

Storm Vs. DataTorrentSolution Matrix DataTorrent Apache Storm

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 3

Dynamic Runtime Changes 3

Linear Scalability 3

State Checkpointing 3

Page 29: Hadoop - Past, Present and Future - v1.1

Apache Spark + Shark

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

Apache Spark

Shark

Hive(sql)

Page 30: Hadoop - Past, Present and Future - v1.1

Hadoop 2.x – YARN + HDFS

NameNode

DataNode / NodeManager DataNode / NodeManager

DataNode / NodeManager DataNode / NodeManager

StandbyNameNode /

ResourceManager

ContainerContainer

ContainerContainer

ContainerContainer

ContainerContainer

Page 31: Hadoop - Past, Present and Future - v1.1

Backwards CompatibleYARN is Backwards Compatible for your existing MapReduce applications. You can get value from it right away.

YARN: Key Take-Aways

Resource ManagementYARN enables Fine Grained Resource Management for better cluster utilization.

One Source of DataYARN allows you to interact with One Source of Data in multiple ways while maintaining Predictable Performance and Quality of Service.

Enabling Smart PeopleYARN is a flexible framework that is giving smart people and companies to do amazing things with data.

YARN will be the de-facto distributed operating system for Big Data

Page 32: Hadoop - Past, Present and Future - v1.1

Storm Vs. DataTorrent - DetailedSolution Matrix DataTorrent Apache Storm

Proprietary / Open Source O OSupport for Hadoop 1.x 1 1

Support for Hadoop 2.x 1 1

Native YARN 1 3

Dashboard 1 3

Extensible via Modules 1 1

Technical Support 1 1

Atomic Micro-batch 1 3

Events per Second Billions Thousands

Automated Parallelism 1 3

Dynamic Runtime Changes 1 3

High Availability 1 2

Prog. Languages Supported Java, Python, etc. Java, Python, etc.

Log Analysis 1 3

Site Operations 1 3

MapReduce Diagnostics 1 3

Open Source Operators Library 1 2

Open Source Application Templates 1 3

Complex Computations (DAG) 1 3

Linear Scalability 1 3

Security 1 3

CLI and Macros 1 3

Configuration Based Specification 1 3

State Checkpointing 1 3

Page 33: Hadoop - Past, Present and Future - v1.1

Users forced to create data system silos for managing mixed workloadsDevelopers forced to abuse very specific MapReduce to fit their use cases

The 1st Generation Of Hadoop

Hadoop

HBase

Page 34: Hadoop - Past, Present and Future - v1.1

Apache Spark

HDFS2(redundant, reliable storage)

YARN(cluster resource management)

Apache Spark

Shark

Hive(sql)

Spark Streaming

MLib(machine learning)

Page 35: Hadoop - Past, Present and Future - v1.1

Project Mgt Committee Members

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 2 4 6 8 10 12 14 16

7

6

3

15

11

Page 36: Hadoop - Past, Present and Future - v1.1

Project Committers

Hortonworks

Others

Cloudera

Yahoo!

Facebook

0 5 10 15 20 25 30

24

24

11

11

5

Page 37: Hadoop - Past, Present and Future - v1.1

YARN: Why The De-Facto Distributed OS

Technology Adoption100,000 nodes+ - 400,000 jobs - 10m compute hours daily

Enables InnovationSmart people and companies to do amazing things to data

Financial Backing568m+ invested in Hadoop contributing companies, nearly 400m in the

2013 alone

Page 38: Hadoop - Past, Present and Future - v1.1

Apache Storm Topology

Bolt(Filter)Spout

Stream(Data Source)

Spout

Stream(Data Source)

Bolt(RDBMS Writes)

Bolt(Calculation)

Bolt(HDFS Writes)

RDBMS

HDFS

Page 39: Hadoop - Past, Present and Future - v1.1

HDFS Write Data FlowNameNode

Client

DataNode DataNode DataNode

1

2

4 5

67

3Block Bytes

Block Bytes Block Bytes

Block Write Complete

AckAck

Ack

A

B

C


Recommended