Upload
rajesh-ananda-kumar
View
803
Download
3
Embed Size (px)
Citation preview
ANATOMY
OF YA
RN OR
MAPREDUCE 2
IN H
ADOOP
RA
J ES
H
AN
AN
DA
K
UM
AR
RA
J ES
H_
12
90
K@
YA
HO
O.
CO
M
What is YARN?
• YARN stands for Yet Another Resource Negotiator. It’s a framework introduced in 2010 by a group ay Yahoo!. This is considered as next generation of MapReduce. YARN is not specific for MapReduce. It can be used for running any application.
Why YARN?
• In a Hadoop cluster which has more than 4000 nodes, Classic MapReduce hit scalability bottlenecks. This is because JobTracker does too many things like job scheduling, monitoring task progress by keeping track of tasks, restarting failed or slow tasks and doing task bookkeeping (like managing counter totals).
How YARN solves the problem?
• The problem is solved by splitting the responsibility of JobTracker (in Classic MapReduce) to different components. Because of which, there are more entities involved in YARN (compared to Classic MR). The entities in YARN are as follows;• Client: which submits the MapReduce job• Resource Manager: which manages the use of resources across the cluster. It creates new containers
for Map and Reduce processes.• Node Manager: In every new container created by Resource Manager, a Node Manager process will be
run which oversees the containers running on the cluster nodes. It doesn’t matter if the container is created for Map or Reduce or any other process. Node Manager ensures that the application does not use more resources than what it is allocated with.
• Application Master: which negotiates with the Resource Manager for resources and runs the application-specific process (Map or Reduce tasks) in those clusters. The Application Master & the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node manager.
• HDFS
How to activate YARN?
• By setting the property ‘mapreduce.framework.name’ to ‘yarn’, the YARN framework will be activated. From then on when a Job is submitted, YARN framework will be used to execute the Job.
FIRST A JOB HAS TO BE SUBMITTED TO HADOOP CLUSTER. LET’S SEE HOW JOB
SUBMISSION HAPPENS IN CASE OF YARN.
MR Program
Job
Client JVM
ResourceManager
HDFSFolder in name of Application ID
getNewApplicationId()submit() submitApplication()
Job JarConfiguration FilesComputed Input Splits
Resource Manager node
• Job submission in YARN is very similar to Classic MapReduce. In YARN its not called as Job instead its called as Application.
• Client calls the submit() (or waitForCompletion()) method on Job.
• The Job.submit() does the following;• A new application ID is retrieved from Resource Manager.• Checks the input and output specification.• Computes input slits.• Creates a directory in the name of Application ID in HDFS.• Copies Job jar, configuration files and computed input splits to this directory.• Informs the Resource manager by executing the submitApplication() on the
Resource manager.
NEXT THE SUBMITTED JOB WILL BE INITIALIZED. NOW LET’S SEE HOW JOB
INITIALIZATION HAPPENS IN YARN.
• ResourceManager.submitApplication() method hands over the job to scheduler. • Scheduler creates a new Container.• All containers in YARN will have an instance of NodeManager running in it, which
manages the actual process which is scheduled to run in the container.• The actual process in this case is an application master. For a MR Job, the main
class for an application master is ‘MRAppMaster’.• MRAppMaster initilizes the job by creating number of bookkeeping objects to track
the job’s progress as it receives progress & completion reports from the tasks.• Next the MRAppMaster retrieves input splits from HDFS.
ResourceManager
submitApplication()
Scheduler
HDFSMap Tasks
Reduce TasksOther Tasks
Bookkeeping Info
Input Splits stored in Job ID directory in HDFS.
T1
T2
T3
T1
JS JCS1
S2
S3
• Now application master, creates one map task per split and it checks the ‘mapreduce.job.reduces’ property & creates those many number of reducers.
Container
NodeManager
Application Master
MRAppMaster
• At this point, application master knows how big the job is and it decides if it can execute the job in the same JVM as itself or should it run each of these tasks parallel in different containers. Such small jobs are said to be uberized or run as uber task.
• This decision is made by the application master based on the following property configurations;• mapreduce.job.ubertask.maxmaps• mapreduce.job.ubertask.maxreduces• mapreduce.job.ubertask.maxbytes• mapreduce.job.ubertask.enable
• Before any task can run, application master executes the job setup methods to create job’s output directory.
IF THE JOB IS NOT RUN AS UBER TASK, THEN APPLICATION MASTER REQUESTS
CONTAINERS FOR ALL MAP AND REDUCE TASK. THIS PROCESS IS CALLED TASK
ASSIGNMENT. NOW LET’S SEE HOW TASK ASSIGNMENT HAPPENS IN YARN.
• Application master sends heartbeat signal to Resource Manager every few seconds. Application Master uses this signal to request containers for Map and Reduce tasks.
Resource Manager
Application Master sends a heartbeatsignal with request for map and reducetasks
• The container request includes information about map task’s data locality i.e., host and rack in which the split resides.
• Unlike MR 1, where there are fixed number of slots and fixed amount of resources allocated, YARN is pretty flexible in resource allocation. The request (which is sent along with the heartbeat signal) for container can include a request for amount of memory needed for the task. The default for map and reduce task is 1024 MB.
• Scheduler uses this information to make scheduling decisions. Scheduler tries to do a local placement. If not possible, it tried for rack-local placement. Else non-local placement. Refer : Replica placement slide.
Container
NodeManager
MRAppMaster
• Once the Resource Manager gets this request, it creates a new container and starts Node Manager instance in it to manage the Map or Reduce task for which the container was created for. It also ensures that the requested amount of resources are allocated to the container.
NOW TASKS ARE ASSIGNED TO CONTAINER WHICH FOLLOWS A SERIES OF STEPS TO EXECUTE A TASK. LET’S SEE HOW TASKS ARE EXECUTED IN A
YARN CONTAINER.
Distributed Cache
HDFS
Folder created in container’s local.
• Application Master starts a container through the Node Manager running in the other container.`• Node Manager (in the other container) spawns a new JVM process and launches a new Java application called ‘YarnChild’. The reason for a new JVM process is same as MR 1. And YARN doesn’t support JVM reuse.
• The work of YarnChild is to execute the actual process (Map or Reduce).• First YarnChild tries to localize the resources like Job jar, configuration files and
supporting files from Distribute Cache.• Once the resources are localized, YarnChild begins executing the Map or Reduce
task.
Container
NodeManager
Container
NodeManager
MRAppMaster
JVM Process
YarnChild
Map/Reduce Task
Un-jar the job jar contents
SINCE TASKS ARE EXECUTED IN A DISTRIBUTED ENVIRONMENT, TRACKING THE PROGRESS AND STATUS OF JOB IS
TRICKY. LET’S SEE HOW PROGRESS AND STATUS UPDATES ARE TAKEN CARE IN
YARN.
• Clients poll the Application master every second to receive the progress updates. This can be configured using the property mapreduce.client.progressmonitor.pollinterval.
• The Application Master then aggregates this to build the overall job progress.• The task sends its progress and counters are sent to Application Master once in
three seconds.
Container
NodeManager
Container
NodeManager
MRAppMaster
JVM Process
YarnChild
Map/Reduce Task
Client Node
Job
getStatus()
MapReduce Program
Job: SFO CrimeJob Status: RunningTask & task status
• The Application Master Web UI displays all the running applications with links to the web Uis of respective application masters, each of which displays further details on the MR job, including its progress.
THIS EXECUTION PROCESS CONTINUES TILL ALL THE TASKS ARE COMPLETED. ONCE THE LAST TASK IS COMPLETED, MR FRAMEWORK ENTERS THE LAST PHASE CALLED JOB COMPLETION.
• When the job is completed, the application master and task containers clean up their working state, and the OutputCommiter’s job cleanup method is executed.
• If the property ‘job.end.notification.url’ is set, the Job Tracker will send a HTTP job notification to the client.
• Job information is archived by the job history server to enable later interrogation by users if desired.
THE END
PLEASE SEND YOUR VALUABLE FEEDBACK TO [email protected]