Upload
jodie
View
103
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hadoop YARN in the Cloud. Junping Du Staff Engineer, VMware China Hadoop Summit, 2013. Agenda. Hadoop YARN – Hub for Big Data Applications YARN and Cloud Computing HVE ( Hadoop Virtualization Extension) work on YARN. Hadoop MapReduce v1 (Classic). JobTracker - PowerPoint PPT Presentation
Citation preview
Hadoop YARN in the Cloud
Junping DuStaff Engineer, VMware
China Hadoop Summit, 2013
Agenda
• Hadoop YARN – Hub for Big Data Applications
• YARN and Cloud Computing
• HVE (Hadoop Virtualization Extension) work on YARN
Hadoop MapReduce v1 (Classic)
• JobTracker– Manage cluster
resources and job scheduling
• TaskTracker– Per node agent– Manage tasks
MapReduce v1 Limitations
• Scalability– Manage cluster resources and job scheduling
• SPOF (Single Point Of Failure)• JobTracker failure cause all queued and running job
failure– Restart is very tricky due to complex state
• Hard partition of resources into map and reduce slots– Low resource utilization
• Lacks support for alternate paradigms• Lack of wire-compatible protocols
YARN Architecture• Splits up the two major functions of
JobTracker– Resource Manager (RM) - Cluster resource
management– Application Master (AM) - Task scheduling and
monitoring• NodeManager (NM) - A new per-node slave
– launching the applications’ containers– monitoring their resource usage (cpu, memory)
and reporting to the Resource Manager.• YARN maintains compatibility with existing
MapReduce application and support other applications
YARN – Hub for Big Data Applications
YARN
MapReduce Tez
HDFS
Storm
Spark
HBaseImpala
OpenMPI Distributed Shell
• App-specific AM• HOYA (Hbase On YArn)
– Long running services (YARN-896)• LLAMA (Low Latency Application MAster)
– Gang Scheduler (YARN-624)
• Two different prospective:– YARN-centric prospective• YARN is the key platform to apps• YARN is independent of infrastructure, running on top of
Cloud shows YARN’s generality – Cloud-centric prospective• YARN is an umbrella kind of applications• Supporting YARN shows Cloud’s generality
YARN and Cloud
YARN and Cloud: YARN-centric Prospective
YARN
Bare-metal machines
MapReduce Tez Storm
SparkHBase
Impala
Open MPI Distributed Shell
VMware Open Stack
Infrastructure
Big Data Apps
…
…
Cloud Infrastructure
…
YARN and Cloud: Cloud-centric Prospective
YARN
MapReduce Tez Storm
SparkHBase
Impala
Open MPI D.S
Cloud Infrastructure (VMware, Open Stack, etc.)
YARN AppsLegacy Apps Non-YARNBig Data Apps
……
• Similarity – Target to share resources across applications– Provide Global Resource Management
• YARN vs. Cloud– YARN managing resource in OS layer vs. Cloud
managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS )
– Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud )
– YARN tracking application-specific metrics/progress, Cloud only track underlayer resources (YARN )
YARN vs. Cloud
• Why YARN + Cloud?– Leverage virtualization in strong isolation, fine-grained resource
sharing and other benefits– Uniform infrastructure to simplify IT in enterprise
• What it looks like?– Running YARN NM inside of VMs managed by Cloud Infrastructure– Build communication channel between YARN RM and Cloud
Resource Manager for coordination• How we do?– First thing above is very easy and smoothly– Second things to achieve in two ways
• YARN can aware/manipulate Cloud resource change• YARN provide a generic resource notification mechanism so Cloud
Manager can use when resource changing
YARN + Cloud
• VM’s resource boundary can be elastic– CPU is easy – time slicing (with constraints)– Memory is harder – page sharing and memory ballooning– In case of contention, enforce limits and proportional sharing– “Stealing” resources behind apps could cause bad performance (paging)– App aware resource management could address these issues
• Hadoop YARN Resource Model– Dynamic with adding/removing nodes– But static for per node
• In this case, shall we enable resource elasticity on VM?– If yes, low performance when resource contention happens.– If no, low utilization as physical boxes because free resources cannot be
leveraged by other busy VMs• We need better answer .
Elastic YARN Node in the Cloud
HVE provide the answer!• Hadoop Virtualization Extensions– A project to enhance Hadoop running on
virtualization• Goal: Make Hadoop Cloud-Ready– Provide Virtualization-awareness to Hadoop, i.e.
virtual topology, virtual resources, etc.– Deliver generic utility that can be leveraged by
virtualized platform • Independent of virtualization platform and cloud
infrastructure• 100% contribution to Apache Hadoop
Community
HVE• Philosophy– make infrastructure related components abstract– deliver different implementations that can be
configured properly• E.g.
BlockPlacementPolicy
BlockPlacementPolicy(Abstract)
BlockPlacementPolicyDefault
BlockPlacementPolicyFor Virtualization
Virtualization Host
Elastic YARN Node in the Cloud
VirtualYARNNode
OtherWorkload
VMDK
Datanode
NodeManager
ContainerContainer
Add/RemoveResources?
Grow/Shrinkby tens of GB in memory?
Grow/Shrink resource of a VM
Implementation – YARN-291 (umbrella)
• YARN-311– Core scheduler changes
• YARN-313• CLI
• YARN-312– AdminProtocol changes
• REST API, JMX, etc.
Node Manager
SchedulerNode
Cloud Resource Manager
Resource Manager
Resource Tracker Service
Scheduler
RMContext
RMNode
Heartbeat
Admin CLIAdminServiceCluster Resource
UpdateNodeResource()
yarn rmadmin -updateNodeResource <NodeId> <Resource>
Reference• YARN MapReduce 2.0– https://issues.apache.org/jira/browse/MAPREDUCE-279
• HVE topology extension– https://issues.apache.org/jira/browse/HADOOP-8468
• HVE topology extension for YARN– https://issues.apache.org/jira/browse/YARN-18
• HVE elastic resource configuration– https://issues.apache.org/jira/browse/YARN-291
• Gang Scheduling– https://issues.apache.org/jira/browse/YARN-624
• Long-lived services in YARN– https://issues.apache.org/jira/browse/YARN-896
Thanks!
Junping Du [email protected]