Cluster Scheduler

Cluster Scheduler

Reference:Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI’2011

Multi-agent Cluster Scheduling for Scalability andFlexibility. Berkerly techdoc EECS-2012-273. (doctoral dissertation)

Omega: flexible scalable schedulers for large compute clusters EuroSys’2013

Ytao 2013.5.15

Cluster Scheduler(Intro)

• Cloud computing framework varies.

• New frameworks will likely continue to merge, and no single framework will be optimal for all application.

• Cluster with multiple frameworks improves ultilization and data sharing.

Problem Statement:

• In the face of increasing demand for cluster resources by diverse cluster computing applications and the growing number of machines in typical clusters,it is a challenge to design cluster schedulers that provide flexible, scalable, and effcient resource allocations

Cluster SchedulerMonolithic State Scheduling(MSS)

• Traditional & popular ones: (LSF , condor Hadoop)

• Concept: a single scheduling agent process that makes all scheduling decisions sequentially

• Usage: The agent takes input about framework requirements, resource availability, and organizational policies, and computes a global schedule for all tasks

Cluster SchedulerMonolithic State Scheduling(MSS 2)

• Advantage: optimal scheduling. Global!• Challenge:– Complexity: capture all framework requirements– Scalebility : New frameworks emerge– Lose framework’s own scheduling optimization

Cluster Scheduler (update)

• Scalability. (response time, number of machines)• Flexibility (heterogeneous mix of job)• Usability and Maintainability(easily adapt new

types of jobs, frameworks)• Fault isolation(Minimize dependencies between

unrelated jobs)• Utilization(Achieve high cluster resource

utilization. e.g., cpu utilization, memory utilization)

Partitioned State Scheduling(PSS)

• PSS: in PSS, cluster state is divided between multiple scheduling agents as non-overlapping scheduling domains

• Statically Partitioned State Scheduling (SPS): statically set cluster resources for particular frameworks.

• Dynamically Partitioned State Scheduling(DPS) : Mesos NSDI 2011

Replicated State Scheduling(RSS)

• in RSS scheduling domains may overlap and optimistic consistency control is used to resolve conflicting transactions

• Omega EuroSys 2013

Cluster Environment

• Use of commodity servers• Tens to hundreds of thousands of servers• Heterogeneous resources• Use of commodity networks

Mix workloads

• Service Jobs vs. Terminating Jobs• Service Jobs consist of a set of service tasks that conceptually are intended

to run forever, and these tasks are interacted with by means of request-response interfaces. , e.g., a set of web servers or relational database servers.

• Terminating Jobs, on the other hand, are given a set of inputs, perform some work as a function of those inputs, and are intended terminate eventually (traditional HPC cluster management only considers this)

Mesos• Goal:

– Support and demonstrate multi-agent scheduling– Support fair-sharing meta-scheduling policy– Increase overall cluster utilization– Scale to tens of thousands of machines and hundreds of jobs

• Mesos aims to provide a scalable and resilient core for enabling various frameworks to efficiently share clusters.

• To define a minimal interface that enables efficient resource sharing across frameworks,

• push control of task scheduling and execution to the frameworks.

Mesos (resource allocation)

• Resource offer strategy: spare available resource – Fairness– Priority

• Framework rejects offer if resource allocation is not satisfied. Wait for a good offer.

Mesos(sum)

• take advantage of short tasks to increase cluster utilization

• Two-level :Resource offer and scheduler makes it scalable.

• Good at batch jobs. How about service jobs?

RSS Omega

• One of the major drawbacks of Partitioned State Scheduling is that scheduling domains must be selected before the scheduling agent performs its task-resource assignments, thereby potentially restricting the “goodness of fit” that might be achieved by the scheduling agent in its task-resource assignments.

Role of job manager• 1.If job queue is not empty, remove next job from job queue• 2. Sync: Begin a transaction by synchronizing private cluster state with

common cluster• 3. Schedule: Engage scheduling agent to attempt to create task-resource

assignments for all tasks in job, modifying private cluster state in the process.• 4. Submit: Attempt to commit job transaction (i.e., all task-resource

assignments for the job) from private cluster state back to common cluster state. Job transaction can succeed or fail.

• 5. Record which task-resource assignments were successfully committed to common cluster state.

• 6. If any tasks in job remain unscheduled---either because no suitable resources were found for the task during the “schedule" stage or the task-resource assignment, experienced a---insert job back into job queue to be handled again in a future transaction

Role of meta scheduling agent

• Attempt to execute transactions submitted by job managers according to transaction mode settings

• Detect conflicts according to conflict detection semantics and policies

• Enforce meta-scheduling policies

For each submitted job

• 1. reject task-resource assignments that would violate policies

• 2. reject task-resource assignments that conflict with previously accepted transactions

• 3. reject job all task-resource assignments in a job transaction if using all-or-nothing transaction semantics and at least one task-resource assignment was rejected

Conflict mode

• Machine-granularity:• Reject task-resourse assign only if sequential

number and machine are both conflicted.

• Resource-granularity:• Reject task-resourse assign if cluster state

enter an invalid state(cpu memory)

Transaction Granularity Semantics

• all-or-nothing – if a single task-resource assignment conflicts

thennone of the task-resource assignments in the transaction are applied to common cluster state

• incremental– each task-resource assignment can fail

independent of all others

Omega (sum)

• trade of between whole cluster resource utilization and conflicts.

• By having a view of whole cluster, dynastically assign task number based on global optimal is feasible

• Meta-sheduler needs more modification

• Thanks

Documents

Cluster Scheduler