22
Future of Data Processing with Apache Hadoop Page 1 Vinod Kumar Vavilapalli Lead Developer @ Hortonworks @Tshooter (@hortonworks)

YARN: Future of Data Processing with Apache Hadoop

Embed Size (px)

Citation preview

Page 1: YARN: Future of Data Processing with Apache Hadoop

Future of Data Processing with Apache Hadoop

Page 1

Vinod Kumar Vavilapalli Lead Developer @ Hortonworks @Tshooter (@hortonworks)

Page 2: YARN: Future of Data Processing with Apache Hadoop

Hello! I’m Vinod

Page 2

•  Lead Software Developer @ Hortonworks – MapReduce and YARN – Hadoop-1.0, Hadoop security, CapacityScheduler and

HadoopOnDemand before that. – Previously at Yahoo! stabilizing Hadoop, helping take it

to 10s of thousands of nodes scale, the fruits of which the ROTW is enjoying J

•  Apache Hadoop, ASF

– Committer and PMC member of Apache Hadoop. –  Lurking in MapReduce and YARN mailing lists and bug

trackers. – Hadoop YARN project lead developer.

Page 3: YARN: Future of Data Processing with Apache Hadoop

Agenda

Page 3

• Overview

• Status Quo

• Architecture

• Improvements and Updates

• Q&A

Page 4: YARN: Future of Data Processing with Apache Hadoop

Hadoop MapReduce Classic

•  JobTracker

–  Manages cluster resources and job scheduling •  TaskTracker

–  Per-node agent

–  Manage tasks

Page 5: YARN: Future of Data Processing with Apache Hadoop

Current Limitations

•  Scalability –  Maximum Cluster size – 4,000 nodes –  Maximum concurrent tasks – 40,000 –  Coarse synchronization in JobTracker

•  Single point of failure –  Failure kills all queued and running jobs –  Jobs need to be re-submitted by users

•  Restart is very tricky due to complex state

5

Page 6: YARN: Future of Data Processing with Apache Hadoop

Current Limitations

•  Hard partition of resources into map and reduce slots –  Low resource utilization

•  Lacks support for alternate paradigms –  Iterative applications implemented using MapReduce are

10x slower –  Hacks for the likes of MPI/Graph Processing

•  Lack of wire-compatible protocols –  Client and cluster must be of same version –  Applications and workflows cannot migrate to different

clusters

6

Page 7: YARN: Future of Data Processing with Apache Hadoop

Requirements

•  Reliability •  Availability

•  Utilization

•  Wire Compatibility

•  Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

•  Scalability - Clusters of 6,000-10,000 machines –  Each machine with 16 cores, 48G/96G RAM, 24TB/36TB

disks –  100,000+ concurrent tasks –  10,000 concurrent jobs

7

Page 8: YARN: Future of Data Processing with Apache Hadoop

Design Centre

•  Split up the two major functions of JobTracker –  Cluster resource management –  Application life-cycle management

•  MapReduce becomes user-land library

8

Page 9: YARN: Future of Data Processing with Apache Hadoop

Architecture

•  Application –  Application is a job submitted to the framework –  Example – Map Reduce Job

•  Container –  Basic unit of allocation –  Example – container A = 2GB, 1CPU –  Replaces the fixed map/reduce slots

9

Page 10: YARN: Future of Data Processing with Apache Hadoop

Architecture

•  Resource Manager –  Global resource scheduler –  Hierarchical queues

•  Node Manager –  Per-machine agent –  Manages the life-cycle of container –  Container resource monitoring

•  Application Master –  Per-application –  Manages application scheduling and task execution –  E.g. MapReduce Application Master

10

Page 11: YARN: Future of Data Processing with Apache Hadoop

Architecture

ResourceManager

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

Container

NodeManager

App Mstr

Node Status

Resource Request

ResourceManager

Client

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

App Mstr Container

NodeManager

App Mstr

Node Status

Resource Request

ResourceManager

Client

MapReduce Status

Job Submission

Client

NodeManager

Container Container

NodeManager

App Mstr Container

NodeManager

Container App Mstr

Node Status

Resource Request

Page 12: YARN: Future of Data Processing with Apache Hadoop

Improvements vis-à-vis classic MapReduce

12

•  Utilization –  Generic resource model

•  Data Locality (node, rack etc.) •  Memory •  CPU •  Disk b/q •  Network b/w

–  Remove fixed partition of map and reduce slot •  Scalability

–  Application life-cycle management is very expensive –  Partition resource management and application life-cycle management –  Application management is distributed –  Hardware trends - Currently run clusters of 4,000 machines

•  6,000 2012 machines > 12,000 2009 machines •  <16+ cores, 48/96G, 24TB> v/s <8 cores, 16G, 4TB>

Page 13: YARN: Future of Data Processing with Apache Hadoop

•  Fault Tolerance and Availability –  Resource Manager

•  No single point of failure – state saved in ZooKeeper (coming soon)

•  Application Masters are restarted automatically on RM restart –  Application Master

•  Optional failover via application-specific checkpoint •  MapReduce applications pick up where they left off via state saved

in HDFS

•  Wire Compatibility –  Protocols are wire-compatible –  Old clients can talk to new servers –  Rolling upgrades

13

Improvements vis-à-vis classic MapReduce

Page 14: YARN: Future of Data Processing with Apache Hadoop

•  Innovation and Agility

–  MapReduce now becomes a user-land library –  Multiple versions of MapReduce can run in the same cluster

(a la Apache Pig) •  Faster deployment cycles for improvements

–  Customers upgrade MapReduce versions on their schedule –  Users can customize MapReduce

14

Improvements vis-à-vis classic MapReduce

Page 15: YARN: Future of Data Processing with Apache Hadoop

•  Support for programming paradigms other than MapReduce –  MPI –  Master-Worker –  Machine Learning –  Iterative processing –  Enabled by allowing the use of paradigm-specific application

master –  Run all on the same Hadoop cluster

15

Improvements vis-à-vis classic MapReduce

Page 16: YARN: Future of Data Processing with Apache Hadoop

Any Performance Gains?

Page 16

• Significant gains across the board, 2x in some cases.

• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun,

2009) – Shuffle 30%+ – Small Jobs – Uber AM – Re-use task slots (containers

More details: http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/

Page 17: YARN: Future of Data Processing with Apache Hadoop

Testing?

Page 17

• Testing, *lots* of it • Benchmarks (every release should be at least as good as the last one)

• Integration testing – HBase – Pig – Hive – Oozie

• Functional tests – Nightly – Over 1000 functional tests for Map-Reduce alone – Several hundred for Pig/Hive etc.

• Deployment discipline

Page 18: YARN: Future of Data Processing with Apache Hadoop

Benchmarks

Page 18

• Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort

• GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode

Page 19: YARN: Future of Data Processing with Apache Hadoop

Deployment

Page 19

• Alpha/Test (early UAT) in November 2011 – Small scale (500-800 nodes)

• Alpha in February 2012 – Majority of users – ~1000 nodes per cluster, > 2,000 nodes in all

• Beta – Misnomer: 10s of PB of storage – Significantly wide variety of applications and load –  4000+ nodes per cluster, > 15000 nodes in all – Q3, 2012

• Production – Well, it’s production

Page 20: YARN: Future of Data Processing with Apache Hadoop

Resources

Page 20

hadoop-2.0.2-alpha: http://hadoop.apache.org/common/releases.html Release Documentation: http://hadoop.apache.org/common/docs/r2.0.2-alpha/ Blogs: Blog series: http://hortonworks.com/blog/category/apache-hadoop/yarn/ http://hortonworks.com/blog/introducing-apache-hadoop-yarn/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/ http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

Page 21: YARN: Future of Data Processing with Apache Hadoop

© Hortonworks Inc. 2012

What next?

•  Expert role based training •  Course for admins, developers

and operators •  Certification program •  Custom onsite options

Page 21

Download Hortonworks Data Platform 2.0 (for preview only) hortonworks.com/download

1

2 Use the getting started guide hortonworks.com/get-started

3 Learn more….

•  Full lifecycle technical support

across four service levels •  Delivered by Apache Hadoop

Experts/Committers •  Forward-compatible

Hortonworks Support

hortonworks.com/training hortonworks.com/support

Page 22: YARN: Future of Data Processing with Apache Hadoop

© Hortonworks Inc. 2012

Questions & Answers

TRY download at hortonworks.com

LEARN Hortonworks University

FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks

MORE EVENTS hortonworks.com/events

Page 22

Further questions & comments: [email protected]