52
Azure Cluster Scheduling at Microsoft Scale MIT, 10/31/2019 Computer Networks (6.829) Konstantinos Karanasos

Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Azure

Cluster Scheduling at Microsoft Scale

MIT, 10/31/2019

Computer Networks (6.829)

Konstantinos Karanasos

Page 2: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

GSL (CISL)

Mission: applied research lab working on systems for big data, cloud, and machine learning

Office of the CTO for Azure Data

~15 researchers/engineers in the Bay Area, Redmond, and Madison

Page 3: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

The three hats we wear

Applied research group

Collaborating with database, big data, and AI infra groups at MS

Open-sourcing our codeApache Hadoop, ONNX Runtime, MlFlow, REEF, Heron

Page 4: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Current Focus Areas

Resource management

Systems for ML

ML for Systems

Query Optimization Provenance

Page 5: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

What is a Resource Manager?

Node Manager

Node Manager

Node Manager

1. Request

2. Allocation

3. Start container

acquire cluster resources in the form of containersDo we really need a

Resource Manager?

Page 6: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Lessons learned: Abstracting out the RM layer

Ad-hocapp

Ad-hocapp

Ad-hocapp

Ad-hocApps

YARN

MR v2

Tez Giraph Storm DryadREEF

...

Hive / Pig

Hadoop 1.x(MapReduce)

MR v1

Hive / Pig

Users

ApplicationFrameworks

ProgrammingModel(s)

Cluster OS (Resource Management)

Hadoop 1 World Hadoop 2 World

File SystemHDFS 1 HDFS 2

Hardware

Ad-hocapp

Ad-hocapp

Scopeon

YARNSpark

monolithicReuse of RM component

YARN

layering abstractions

Heron

Page 7: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Cosmos: Microsoft’s Analytics Stack

Hydra

Page 8: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Cluster(s):> 50K nodes

Job(s):>2M tasks, >5PB input

Tasks:10 sec 50th %ile

Utilization:~60% avg CPU util

Scheduler:>70K QPS

The Scale and Utilization Challenge

Page 9: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Scheduling in Analytics Clusters: a Journey…

Hadoop MR, Centralized sched.

MR

201920132003 2008

Scope, Centralized sched.

Scope

[eurosys07, vldb08]

HydraMR

YARN

Tez Spark

+multi-framework+security+scheduler expressivity[socc13]

Scope1 Scope2 Scope3

Distributed sched. +tooling/optimizer,+scale, +high utilization[osdi14]

Page 10: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Teaser…

>99% tenants migrated>250K servers>500k daily jobs>1 Zetabyte data processed>1 Trillion tasks scheduled

Page 11: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Agenda

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers

Production Experience

Page 12: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Distributed Scheduling:Legacy Cosmos (Apollo)

Page 13: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy Cosmos (Apollo)

JM2JM1 JM3

Distributed scheduling got us scale

Execution plan

Cosmos Job Service

Compile/Optimize

T = SELECT * FROM blah;

SELECT T.blah FROM T;

AdmissionControl

(gang-queue)

Distributed scheduling+ node-level queuing

* JM: Job Manager

Script

JM2

* JM: Job Manager

Script

Page 14: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy CosmosProblem: (required) static resource allocation yields low utilization

resource fragmentation within and across queues

Solution: opportunistically share resources

“guaranteed” containers

“opportunistic”containers

opportunistic scheduling got us utilization

Page 15: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy Cosmos: Pros/Cons

+ Great scalability

+ Good resource utilization

- No support for multiple frameworks

- Limited control on scheduling (e.g., fairness, load-balancing, locality, special HW)

Page 16: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Centralized Scheduling:Apache Hadoop YARN

Page 17: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy YARN

AM

RM

NM

1) Submit job

2) Admission control (fairness, quotas, constraints, SLOs, …)

3) Schedule Application Master (AM)

4) Start AM container

NMNM

Page 18: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy YARN

AM

RM

NM NM

Task

4) AM requests on heartbeat for more containers

6) Start container

7) AM-Task communication

5) RM grants “token”

Page 19: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Legacy YARN: Pros/Cons

+ Support for arbitrary frameworks

+ Rich scheduling invariants

+ Non-tech advantage: OSS/mind-share

- Scale limits (~5K nodes in 2013)

- Utilization: heartbeats lead to idle resources

Page 20: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Main challenges

High resource utilization Scalability

Rich placement constraints

Production jobs and predictability

Medea [eurosys18]

Mercury [atc15,eurosys16] Federation [nsdi19]

Morpheus [osdi16]

OSS + production!

OSS

Page 21: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Improving YARN’s scalability:Federated Architecture

Page 22: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Need scale and utilization?Go distributed!Need scheduling control and multi-framework?Go centralized!

Want it all?

Page 23: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

RM

sub-cluster 1

RM

sub-cluster 2

Hydra Architecture

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

Page 24: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

NM

Hydra Architecture

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

Page 25: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Hydra Architecture

AMRMProxy

AM

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

Task

NM NM

Page 26: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Hydra Architecture

AMRMProxy

AM

RM

sub-cluster 1

RM

sub-cluster 2

Rout

erRo

uter

Rout

erRo

uter

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

GlobalPolicy

Generator

NM

Page 27: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Scheduling Desiderata: Global Goals

High utilization

Scheduling invariants (e.g., fairness)

Locality (e.g., machine preferences)

Page 28: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

QA QB

root

50% 50%

demand: 12demand: 4

demand: 12demand: 12

✓ Locality, Fairness

✗ Utilization

✓ Utilization, Fairness

✗ Locality

✓ Utilization, locality

✗ Fairness

✓ Utilization✓ Fairness✓ Locality

PoliciesAMRMProxy routing of requests

Enforce locality?Per-cluster RM scheduling decisions

Enforce quotas?

Page 29: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Key Idea

Decouple:

Share determinationHow many resources should a queue get?

PlacementOn which machines should each task run?

Page 30: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

RM

sub-cluster 1

RM

sub-cluster 2

StateStoreProxy

StateStoreProxy

StateStoreProxy

StateStore

GlobalPolicy

Generator•

1

2

3

1 1

2

3

* More advanced than what in prod. (details in paper)

Proposed Solution*

Page 31: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Handling GPG downtime

If GPG is down, we would fallback to local decisionsProblematic if they “diverge” too much from global one

Leverage LP-based “tuning” of local queue allocationHistorical demand as a predictor of future demand

Page 32: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Improving YARN’s Utilization:Opportunistic Containers

Page 33: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Suboptimal Utilization in Scope-on-YARN

NM1 NM2

RMj1j2

Due to:Gang schedulingFeedback delays

5 sec 10 sec 50 sec Mixed-5-50

60.59% 78.35% 92.38% 78.54%

Page 34: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Opportunistic Containers in YARN

AM

RM

NM NM

Task

Mask feedback delays

Only OPP containers can be queued

Promotion/demotion of containers

Page 35: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Utilization Gains with Node-Side Queuing

But are long queues all we need?

Page 36: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Job Completion Times with Node-Side Queuing

Naïve node-side queuing can be detrimental for job completion times

Proper queue management techniques are required

Page 37: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Problems with Node-Side Queuing

Load imbalance across nodesSuboptimal task placement

Head-of-line blockingEspecially for heterogeneous tasks

Early binding of tasks to nodes N1 N2 N3

Page 38: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Queue management techniques

Place tasks to node queues

Prioritize task execution

(queue reordering)

Bound queue lengths

Page 39: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Placement of Tasks to Queues

Placement based on queue lengthAgnostic of task characteristicsSuboptimal placement for heterogeneous workloads

Placement based on queue wait timeBetter for heterogeneous workloadsRequires task duration estimates

24 100

253

N1 N2 N3

RM

Page 40: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Task Prioritization

Queue reordering strategiesShortest Remaining Job First (SRJF)Least Remaining Tasks First (LRTF)Shortest Task First (STF)Earliest Job First (EJF)

SRJF and LRTF are job-awareDynamically reorder tasks based on job progress

Starvation freedomGive priority to tasks waiting more than X secs

N1 N2 N3

RMj2: 5 tasksj3: 9 tasksj1: 21 tasks

Page 41: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Bounding Queue Lengths

Determine max number of tasks at a queueTrade-off between short and long queues

Short queuesResource idling à lower throughput

Long queuesHigh queuing delays, early binding of tasks to queues à longer job completion times

Static and dynamic queue bounding

Page 42: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Queuing Techniques: Evaluation

1.7x improvement in median JCT over YARN

Both bounding and reordering are crucial

Page 43: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Production Experience

Page 44: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Workload

Page 45: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Scale High scheduling rate at low allocation latency

Page 46: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

UtilizationFederated design improves load balancing, while retaining utilization

Page 47: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Task Performance (comparison)

YARN

Page 48: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Performance

Jobs perform just as well (and tasks are as efficient)

Page 49: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Qualitative Experience

operability

Page 50: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Recap

Overview of legacy systems (Apollo, YARN)

Scale: Federated YARN

Resource utilization: Opportunistic Containers

Production Experience

Page 51: Cluster Scheduling at Microsoft Scaleweb.mit.edu/6.829/www/currentsemester/materials/2019-10-31_RM-MIT_share.pdfOct 31, 2019  · Scheduling in Analytics Clusters: a Journey… Hadoop

Conclusion

Research in Industry is a lot of fun

Fewer bigger projects, with massive winsBig force multipliers

Failure is expected (otherwise we are not trying hard enough)

Modus Operandi

Be picky in choosing problemsEngage early, engage deep

Be inclusive with prod/OSS counterparts