34
Learning Scheduling Algorithms for Data Processing Clusters Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh

Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Learning Scheduling Algorithms for Data Processing Clusters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh

Page 2: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Motivation

Scheduling is a fundamental task in computer systems• Cluster management (e.g., Kubernetes, Mesos, Borg)• Data analytics frameworks (e.g., Spark, Hadoop)• Machine learning (e.g., Tensorflow)

Efficient scheduler matters for large datacenters• Small improvement can save millions of dollars at scale

2

Page 3: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Designing Optimal Schedulers is Intractable

Must consider many factors for optimal performance:• Job dependency structure• Modeling complexity• Placement constraints• Data locality• ……

Graphene [OSDI ’16], Carbyne [OSDI ’16]Tetris [SIGCOMM ’14], Jockey [EuroSys ’12]TetriSched [EuroSys ‘16], device placement [NIPS ’17]Delayed Scheduling [EuroSys ’10]……

Practical deployment:Ignore complexity à resort to simple heuristicsSophisticated system à complex configurations and tuning

No “one-size-fits-all” solution:Best algorithm depends on specific workload and system

Page 4: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Can machine learning help tame the complexity of efficient schedulers for data processing jobs?

Page 5: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Decima: A Learned Cluster Scheduler

5

• Learns workload-specific scheduling algorithms for jobs with dependencies (represented as DAGs)

Job DAG

Job 2 Job 3

Scheduler

Executor 1

Executor m

Executor 2

“Stages”: Identical tasks that can run in parallel

Data dependencies

5

Page 6: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Decima: A Learned Cluster Scheduler

6

• Learns workload-specific scheduling algorithms for jobs with dependencies (represented as DAGs)

Job 1

Scheduler

Server 1

Server m

Server 2

Page 7: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Design overview

State

Job DAG 1 Job DAG n

Executor 1 Executor m

Scheduling Agent

p[Policy Network

GraphNeural

Network

EnvironmentSchedulable

NodesObjective

Reward

Observation of jobs and cluster status

Page 8: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Number of servers working on this job

Scheduling policy: FIFO

Average Job Completion Time:

225 sec

Demo

Page 9: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Scheduling policy: Shortest-Job-First

Average Job Completion Time:

135 sec

Page 10: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Scheduling policy: Fair

Average Job Completion Time:

120 sec

Page 11: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

FairShortest-Job-First

Average Job Completion Time:135 sec

Average Job Completion Time:120 sec

Page 12: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Scheduling policy: Decima

Average Job Completion Time:

98 sec

Page 13: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

FairDecima

Average Job Completion Time:98 sec

Average Job Completion Time:120 sec

Page 14: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Contributions

State

Job DAG 1 Job DAG n

Executor 1 Executor m

Scheduling Agent

p[Policy Network

GraphNeural

Network

EnvironmentSchedulable

NodesObjective

Reward

Observation of jobs and cluster status

1. First RL-based scheduler for complex data processing jobs2. Scalable graph neural network to express scheduling policies3. New learning methods that enables training with online job arrivals

14

Page 15: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Encode scheduling decisions as actions

15

Job DAG 1

Job DAG n

Server 1

Server 2

Server 4

Server 3

Server m

Set of identical free executors

Page 16: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Option 1: Assign all Executors in 1 Action

Problem: huge action space

Job DAG 1

Job DAG n

Server 1

Server 2

Server 4

Server 3

Server m16

Page 17: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Option 2: Assign One Executor Per Action

Problem: long action sequences

Job DAG 1

Job DAG n

Server 1

Server 2

Server 4

Server 3

Server m17

Page 18: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Decima: Assign Groups of Executors per Action

18

Job DAG 1

Job DAG n

Server 1

Server 2

Server 4

Server 3

Server m

Use 3 servers

Use 1 server

Use 1 server

Action = (node, parallelism limit)

Page 19: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

19

Job DAG 1

Job DAG n

Node features:

• # of tasks • avg. task duration • # of servers currently

assigned to the node• are free servers local to

this job?

Arbitrary number of jobs

Process Job Information

Page 20: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

20

Graph Neural Network

Job DAG

68

3

2

Score on each node

Page 21: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Training

21

Decima agent cluster

Reinforcement learning training

Generate experience data

Page 22: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Time

Number of backlogged jobs

Handle Online Job Arrival

The RL agent has to experience continuous job arrival during training.

→ inefficient if simply feeding long sequences of jobs

Initial random policy

22

Page 23: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

The RL agent has to experience continuous job arrival during training.

→ inefficient if simply feeding long sequences of jobs

Time

Number of backlogged jobs

Waste training time

Initial random policy

Handle Online Job Arrival

23

Page 24: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

The RL agent has to experience continuous job arrival during training.

→ inefficient if simply feeding long sequences of jobs

Time

Number of backlogged jobs

Early reset for initial training

Initial random policy

Handle Online Job Arrival

24

Page 25: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

The RL agent has to experience continuous job arrival during training.

→ inefficient if simply feeding long sequences of jobs

Time

Number of backlogged jobs

As training proceeds, stronger policy keeps the queue stable

Increase the reset time

Curriculum learning

Handle Online Job Arrival

25

Page 26: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

26

Variance from Job Sequences

RL agent needs to be robust to the variation in job arrival patterns.

→ huge variance can throw off the training process

Page 27: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

27

Job size

Timet

Future workload #1

Future workload #2

action at

Score for action at = (return after at) − (average return)= ∑#$%#& '#$ − ((*#)

Must consider the entire job sequence to score actions

Variance from Job Sequences

Page 28: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Average return for trajectories from state stwith job sequence zt, zt+1, …

Score for action at = ∑"#$"% &"# − ((*")Score for action at = ∑"#$"% &"# − ((*", -", -"./, … )

Input-Dependent Baseline

28

Broadly applicable to other systems with external input process: Adaptive video streaming, load balancing, caching, robotics with disturbance…

• Variance reduction for reinforcement learning in input-driven environments. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh. International Conference on Learning Representations (ICLR), 2019.

Page 29: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

• 20 TPC-H queries sampled at random; input sizes: 2, 5, 10, 20, 50, 100 GB

• Decima trained on simulator; tested on real Spark cluster

29

Decima improves average job completion time by 21%-3.1x over baseline schemes

Decima vs. Baselines: Batched Arrivals

Page 30: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

30

Decima with Continuous Job Arrivals

1000 jobs arrives as a Poisson process with avg. inter-arrival time = 25 sec

Decima achieves 28% lower average JCT than best heuristic, and 2X better JCT in overload

Bette

r

Page 31: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Tuned weighted fairDecima

31

Understanding Decima

Tuned weighted fairDecima

Page 32: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Industrial trace (Alibaba): 20,000 jobs from production cluster

Multi-resource requirement: CPU cores + memory units

Flexibility: Multi-Resource Scheduling

32

Page 33: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

• Impact of each component in the learning algorithm

• Generalization to different workloads

• Training and inference speed

• Handling missing features

• Optimality gap

Other Evaluations

33

Page 34: Learning Scheduling Algorithms for Data Processing Clusters · Observation of jobs and cluster status 1. First RL-based scheduler for complex data processing jobs 2. Scalable graph

Summary

• Decima uses reinforcement learning to generate workload-specificscheduling algorithms

• Decima employs curriculum learning and variance reduction to enable training with stochastic job arrivals

• Decima leverages a scalable graph neural network to process arbitrary number of job DAGs

• Decima outperforms existing heuristics and is flexible to apply to other applications

http://web.mit.edu/decima/34