66
Google Confidential and Proprietary Omega: flexible, scalable schedulers for large compute clusters Malte Schwarzkopf (University of Cambridge Computer Lab) Andy Konwinski (UC Berkeley) Michael Abd-El-Malek (Google) John Wilkes (Google) original: April 17th, 2013 this version: Oct 2013 EuroSys 2013 1

Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Omega: flexible, scalable schedulers for large compute clusters

Malte Schwarzkopf (University of Cambridge Computer Lab)

Andy Konwinski (UC Berkeley)

Michael Abd-El-Malek (Google)

John Wilkes (Google)

original: April 17th, 2013this version: Oct 2013

EuroSys 20131

Page 2: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

http://www.google.com/about/datacenters/inside/locations/

We own and operate data centers around the world

Page 3: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent
Page 4: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent
Page 5: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent
Page 6: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

the scheduling problem

EuroSys 2013

Tasks

Job

Machines

2

Page 7: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

trends observed

Diverse workloads

Increasing cluster sizes

Growing job arrival rates

EuroSys 20133

Page 8: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

why is this a problem?

EuroSys 2013

Cluster machines (10,000s)

Cluster scheduler

Arriving jobs and tasks (1,000s)

scheduling logic

4

60+ seconds!

Page 9: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

why is this a problem?

EuroSys 2013

Cluster machines(10,000s)

Cluster scheduler

Arriving jobs and tasks (1,000s)

scheduling logic

Hence:Break up into independent schedulers.

Increasing complexity!

But: How do we arbitrate resources between schedulers?

Page 10: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

existing approaches

static partitioning

● poor utilization● inflexible

S0 S1 S2

monolithic scheduler

SCHEDULER

● hard to diversify● code growth● scalability bottleneck

EuroSys 20136

Page 11: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

existing approaches

EuroSys 2013

S0 S1 S2

CLUSTER STATE

shared-state

e.g. UCB Mesos [NSDI 2011]

● hoarding● information hiding

S0 S1 S2

RESOURCE MANAGER

two-level

7

Page 12: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

how does omega work?

EuroSys 2013

S0 S1

8

Page 13: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary9

how does omega work?

EuroSys 2013

S0 S1

Page 14: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

how does omega work?

EuroSys 2013

S0 S1

Page 15: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary11

how does omega work?

EuroSys 2013

S0 S1

Conflict!

Page 16: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

how does omega work?

EuroSys 2013

S0 S1

failure! success!

12

Page 17: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

overview

1) intro & motivation

2) workload characterization

3) comparison of approaches

4) trace-based simulation

5) flexibility case study

EuroSys 201313

Page 18: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: batch/service split

Batch Service

EuroSys 201314

Page 19: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: batch/service split

Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]

Cluster AMedium sizeMedium utilization

Cluster BLarge sizeMedium utilization

Cluster CMedium (12k mach.)High utilizationPublic trace

EuroSys 201315

Page 20: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: batch/service split

Jobs/tasks: countsCPU/RAM: resource seconds [i.e. resource job runtime in sec.]

Cluster AMedium size

Medium utilization

Cluster BLarge size

Medium utilization

Cluster CMedium size

High utilization

Public trace

EuroSys 2013

TAKEAWAY

Most jobs are batch, but most resources are consumed by service jobs.

16

Page 21: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: job runtime distributions

BatchService

EuroSys 2013

Frac

tion

of jo

bs ru

nnin

g fo

r les

s th

an X

Page 22: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: inter-arrival time distributions

ServiceBatch

Frac

tion

of in

ter-

arriv

al g

aps

less

than

X

EuroSys 2013

BatchService

Page 23: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: batch/service split

EuroSys 2013

Batch jobs Service jobs80th %ile runtime

80th %ile inter-arrival time

12-20 min. 29 days

4-7 sec. 2-15 min.

17

Page 24: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

overview

1) intro & motivation

2) workload characterization

3) comparison of approaches

4) trace-based simulation

5) flexibility case study

EuroSys 201318

Page 25: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

methodology: simulation

simulation using

empirical workload parameters distributions

EuroSys 2013

Code available:

http://code.google.com/p/cluster-scheduler-simulator

19

Page 26: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

parameters

Scheduler decision time

n: num. tasks

decision time

EuroSys 2013

ttask: per-task(usually 0.005s

per task)

(usually 0.1s per job)

20

Page 27: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

scheduling policies

EuroSys 201348

Why might scheduling take 60 seconds?

● Large jobs (tens of thousands of tasks)

● Optimization algorithms (constraints, bin packing with knock-on preemption)

● Picky jobs in a full cluster

● Monte Carlo simulations (fault tolerance)

Page 28: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Open issues: failure toleranceTopology-aware scheduling for concurrent outages● a fault tree

Page 29: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a fault DAG

Page 30: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Open issues: failure toleranceTopology-aware schedulingfor concurrent outages● a fault tree● a partially redundant

fault DAG

Page 31: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Open issues: failure tolerance● real fault, or

lost touch?● time to detect vs.

false positives?● multiple

information sources for correlated failures?

Page 32: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

How do does the shared-state design compare with other architectures?

EuroSys 2013

Experiment details:● all clusters, 7 simulated days● 2 schedulers● varying Service scheduler

Experiment 1:

21

Page 33: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

t job for ALL

jobsttask for ALL jobs

schedulerbusyness

monolithic, uniform decision time (single logic)

time spent scheduling total time

blue => all jobs were scheduled

red => unscheduled jobs remained

22

Page 34: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

t job for se

rvice jobs

ttask for service jobs

schedulerbusyness

monolithic, fast-path batch decision time

head-of-lineblocking

23

Page 35: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

t job for se

rvice jobs

ttask for service jobs

schedulerbusyness

mesos v0.9 (of May 2012)

Ooops...

24

Page 36: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

3. Blue receives tiny offer.

EuroSys 2013

mesos

S1 S2

1. Green receives offer of all available resources.

2. Blue's task finishes.

4. Blue cannot use it.

[repeat many times]

5. Green finishes scheduling.

6. Blue receives large offer.

By now, it has given up.

RESOURCE MANAGER

25

Page 37: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

t job for se

rvice jobs

ttask for service jobs

schedulerbusyness

omega, no optimizations

26

Page 38: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

t job for se

rvice jobs

ttask for service jobs

schedulerbusyness

omega, optimized

27

Page 39: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

omega, optimized

TAKEAWAY

The Omega shared-state model performs as well as a (complex) monolithic multi-path

scheduler.

Monolithic Mesos Omega

28

Page 40: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Does the shared-state design scale to many schedulers?

EuroSys 2013

Experiment details:● cluster B, 7 simulated days● 2 schedulers● varying job arrival rate and number of schedulers

Experiment 2:

29

Page 41: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

scaling to many schedulers

30

Page 42: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

overview

1) intro & motivation

2) workload characterization

3) comparison of approaches

4) trace-based simulation

5) flexibility case study

EuroSys 201331

Page 43: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

simulator comparison

homogeneousmachines

lightweight simulator

high-fidelity simulator

job parameters empirical distribution

real-world

workload trace

constraints

scheduling algorithm

runtime

not supported supported

random first fitGoogle

algorithm

fast (24h ≃ 5min) slow (24h ≃ 2h)

32

Page 44: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Experiment details:● cluster C, 29 days● 2 schedulers,

non-uniform decision time● varying Service scheduler

EuroSys 2013

Experiment 3:

How much scheduler interference do we see with real Google workloads?

33

Page 45: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

conflict fraction

EuroSys 2013

num. conflictstotal num. transactions

Page 46: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

scheduler busyness

overhead due to conflicts

EuroSys 201334

sche

dule

r bus

ynes

s

Page 47: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

scheduler busyness

overhead due to conflicts

EuroSys 2013

TAKEAWAY

Interference is higher for real-world settings.

35

Page 48: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

1. Fine-grained conflict detection

optimizations

2. Incremental commits

EuroSys 201336

1st

2nd

#89693

69

Page 49: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Experiment details:● cluster C, 29 days● 2 schedulers,

non-uniform decision time● varying Service scheduler

EuroSys 2013

Experiment 4:

How do the optimizations affect performance?

37

Page 50: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

impact on scheduler utilization

EuroSys 201338

Page 51: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

practical implications – scheduler utilization

overhead due to conflicts

EuroSys 201339

sche

dule

r bus

ynes

s

Page 52: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

practical implications – scheduler utilization

overhead due to conflicts

EuroSys 2013

TAKEAWAY

We can make simple improvements that significantly improve scalability.

40

Page 53: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Case study

MapReduce scheduler with opportunistic extra resources

EuroSys 201341

Page 54: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

11

50

200

1000

8000

100 450

3

5

workers in MR jobs

Number of workers [log10]Snapshot over 29 days

Cou

nt o

f job

s w

ith X

wor

kers

Page 55: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

case study: a MapReduce scheduler

EuroSys 2013cluster C, 29 days

Relative speedup [log10]

Frac

tion

of M

R jo

bs w

ith s

peed

up <

X

60% of MapReduces

43

3-4x speedup!

better

Page 56: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

case study: a MapReduce scheduler

EuroSys 2013cluster C, 29 days

Relative speedup [log10]

Frac

tion

of M

R jo

bs w

ith s

peed

up <

X

60% of MapReduces

3-4x speedup!

TAKEAWAY

The Omega approach gives us the flexibility to easily support custom policies.

44

better

Page 57: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

conclusion

TAKEAWAYS

Flexibility and scale require parallelism,

parallel scheduling works if you do it right, and

using shared state is the way to do it right!

EuroSys 201345

Page 58: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

BACKUP SLIDES

EuroSys 2013

Page 59: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and ProprietaryEuroSys 2013

centralized resource-allocator (not fault-tolerant)

per-job “application manager”(MapReduce calls it a “controller”)

YARN

Apache Hadoop YARN: Yet Another Resource Negotiator. ACM SoCC’13

Page 60: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

methodology: simulation

empiricaldistribution

Event-driven simulator

...

Batch

Service

MapReduce

Workload

Experiment configuration

Cluster state

Initial cluster state

EuroSys 2013

Code available:

http://code.google.com/p/cluster-scheduler-simulator

19

Page 61: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

Frac

tion

of jo

bs ru

nnin

g fo

r les

s th

an X

workload: job runtime distributions

BatchService

EuroSys 2013

TAKEAWAY

Service jobs, once scheduled, run for much longer than batch jobs do.

Page 62: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

workload: inter-arrival time distributions

ServiceBatch

Frac

tion

of in

ter-

arriv

al g

aps

less

than

X

EuroSys 2013

TAKEAWAY

Service jobs arrive much less frequently than batch jobs do.

Page 63: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

CLUSTER STATE

Shared state

the omega approach

Optimistic concurrency

S0 S1 S2● Deltas against shared state

● Easy to develop & maintain

● Heterogeneous schedulers OK

● No explicit coordination required

● Interference resolution (not prevention)

● Scales wellEuroSys 2013

Page 64: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

impact on conflict fraction

EuroSys 2013

Page 65: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

case study: a MapReduce scheduler

EuroSys 2013

50% of MapReduces

4.5x speedup

Relative speedup [log10]

cluster A, 29 days

Frac

tion

of M

R jo

bs w

ith s

peed

up <

X

Page 66: Omega: flexible, scalable schedulers for large compute ... · Cluster machines (10,000s) Cluster scheduler Arriving jobs and tasks (1,000s) scheduling logic Hence: Break up into independent

Google Confidential and Proprietary

caveats, or when this won't work well

● aggressive, systematically adverse workloads or schedulers

● small clusters with high overcommit

EuroSys 2013

deal with using out-of-band or post-facto enforcement mechanisms

Possible problems...