28
ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi , Gagan Agrawal, Srimat Chakradhar

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

  • Upload
    dora

  • View
    71

  • Download
    0

Embed Size (px)

DESCRIPTION

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters. Vignesh Ravi, Michela Becchi , Gagan Agrawal, Srimat Chakradhar. Context. GPUs are used in supercomputers Some of the top500 supercomputers use GPUs Tianhe-1A 14,336 Xeon X5670 processors 7,168 Nvidia Tesla M2050 GPUs - PowerPoint PPT Presentation

Citation preview

Page 1: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar

Page 2: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

2

Context• GPUs are used in supercomputers

– Some of the top500 supercomputers use GPUs• Tianhe-1A

– 14,336 Xeon X5670 processors– 7,168 Nvidia Tesla M2050 GPUs

• Stampede– about 6,000 nodes:

» Xeon E5-2680 8C, Intel Xeon Phi

• GPUs are used in cloud computing

Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs

Page 3: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

3

Categories of Scheduling Objectives• Traditional schedulers for supercomputers aim to

improve system-wide metrics: throughput & latency

• A market-based service world is emerging: focus on provider’s profit and user’s satisfaction– Cloud: pay-as-you-go model

• Amazon: different users (On-Demand, Free, Spot, …)– Recent resource managers for supercomputers (e.g. MOAB)

have the notion of service-level agreement (SLA)

Page 4: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Motivation

• Open-source batch schedulers start to support GPUs– TORQUE, SLURM– Users’ guide mapping of jobs to heterogeneous nodes– Simple scheduling schemes (goals: throughput & latency)

• Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs– [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ]

[gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12]– Simple scheduling schemes (goals: throughput & latency)

• Proposals on market-based scheduling policies focus on homogeneous CPU clusters – [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04]

4

State of the Art

Our Goal:Reconsider market-based scheduling for heterogeneous clusters including GPUs

Page 5: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

5

Considerations• Community looking into code portability between CPU

and GPU– OpenCL– PGI CUDA-x86– MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC

→ Opportunity to flexibly schedule a job on CPU/GPU

• In cloud environments oversubscription commonly used to reduce infrastructural costs

→ Use of resource sharing to improve performance by maximizing hardware utilization

Page 6: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

6

Problem Formulation• Given a CPU-GPU cluster• Schedule a set of jobs on the cluster

– To maximize the provider’s profit / aggregate user satisfaction• Exploit the portability offered by OpenCL

– Flexibly map the job on to either CPU or GPU• Maximize resource utilization

– Allow sharing of multi-core CPU or GPU

Assumptions/Limitations• 1 multi-core CPU and 1 GPU per node• Single-node, single GPU jobs• Only space-sharing, limited to two jobs per resource

Page 7: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Value Function

7

Market-based Scheduling Formulation• For each job, Linear-Decay Value Function [Irwin HPDC’04]

• Max Value → Importance/Priority of job Decay → Urgency of job• Delay due to:

₋ queuing, execution on non-optimal resource, resource sharing

Execution time

Yiel

d/Va

lue

T

Max Value

Decay rate

Yield = maxValue – decay * delay

Page 8: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Overall Scheduling Approach

8

Jobs arrive in batches

Execute on CPU Execute on GPU

Scheduling Flow

Enqueue into CPU Queue Enqueue into GPU Queue

Jobs are enqueued on their optimal resource.

Phase 1 is oblivious of other jobs (based on optimal walltime)

Phase 1:Mapping

Sort jobs to Improve YieldInter-jobs scheduling considerationsSort jobs to Improve Yield

Phase 2:Sorting

Phase 3:Re-mapping

Different schemes:- When to remap?- What to remap?

Page 9: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

9

Phase 1: Mapping

• Users provide walltime on GPU and GPU– walltime used as indicator of optimal/non optimal

resource– Each job is mapped onto its optimal resourceNOTE: in our experiments we assumed

maxValue = optimal walltime

Page 10: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

10

Phase 2: Sorting• Sort jobs based on Reward [Irwin HPDC’04]

• Present Value – f(maxValuei, discount_rate)– Value after discounting the risk of running a job– The shorter the job, the lower the risk

• Opportunity Cost – Degradation in value due to the selection of one among several

alternatives

i

iii Walltime

yCostOpportunituePresentValReward

jijii decaydecayWalltimeyCostOpportunit

Page 11: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

11

Phase 3: Remapping• When to remap:

– Uncoordinated schemes• queue is empty and resource is idle

– Coordinated scheme• When CPU and GPU queues are imbalanced

• What to remap:– Which job will have best reward on non-optimal resource?– Which job will suffer least reward penalty ?

Page 12: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

12

Phase 3: Uncoordinated Schemes1. Last Optimal Reward (LOR)

– Remap job with least reward on optimal resource– Idea: least reward → least risk in moving

2. First Non-Optimal Reward (FNOR)– Compute the reward job could produce on non-optimal resource– Remap job with highest reward on non-optimal resource– Idea: consider non-optimal penalty

3. Last Non-Optimal Reward Penalty (LNORP)– Remap job with least reward degradation RewardDegradationi = OptimalRewardi - NonOptimalRewardi

Page 13: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

13

Phase 3: Coordinated SchemeCoordinated Least Penalty (CORLP)• When to remap: imbalance between queues

– Imbalance affected by: decay rates and execution times of jobs– Total Queuing-Delay Decay-Rate Product (TQDP)

– Remap if |TQDPCPU – TQDPGPU| > threshold

• What to remap– Remap job with least penalty degradation

j

jji decaydelayqueueingTQDP _

Page 14: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Heuristic for Sharing• Limitation: Two jobs can space-share of CPU/GPU• Factors affecting sharing

- Slowdown incurred by jobs using half of a resource+ More resource available for other jobs

• Jobs– Categorized as low, medium, high scaling (based on models/profiling)

• When to enable sharing– Large fraction of jobs in pending queues with negative yield

• What jobs share a resource– Scalability-DecayRate factor

• Jobs grouped based on scalability• Within each group, jobs are ordered by decay rate (urgency)

– Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)

14

Resource Sharing Heuristic

Page 15: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

15

Master Node

Compute Node

Overall System Prototype

Compute Node Compute Node

Page 16: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

16

Master Node

Cluster-Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Multi-core CPU GPU

Overall System Prototype

Compute Node

Multi-core CPU GPU

Compute Node

Multi-core CPU GPU

CPUGPU

CPUGPU

CPUGPU

Page 17: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

17

Master Node

Cluster-Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Overall System Prototype

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

CPUGPU

CPUGPU

Page 18: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

18

Master Node

Cluster-Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Overall System Prototype

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

CPUGPU

CPUGPU

TCP Communicator

CPU Execution Processes

GPU Execution Processes

GPU ConsolidationFramework

OS-basedscheduling & sharing

Page 19: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

19

Master Node

Cluster-Level Scheduler

Scheduling Schemes & Policies

TCPCommunicator

Submission Queue

Pending Queues

Execution Queues

Finished Queues

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Overall System Prototype

Compute Node

Node-Level Runtime

Multi-core CPU GPU

Compute Node

Node-Level Runtime

Multi-core CPU GPU

CPUGPU

CPUGPU

CPUGPU

TCP Communicator

CPU Execution Processes

GPU Execution Processes

GPU ConsolidationFramework

Assumption: shared file system

Centralized decision making

Execution & sharing

mechanisms

OS-basedscheduling & sharing

Page 20: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

20

GPU

CUDA app1

CUDA Driver

CUDA Runtime

GPU execution processes

(Front-End)

GPU Consolidation Framework

Back-End

GPU Sharing FrameworkGPU-related Node-Level Runtime

CUDA appN

…CUDAInterception

Library

CUDAInterception

Library

Front End – Back End Communication Channel

Page 21: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

21

Front End – Back End Communication Channel

GPU

CUDAInterception

Library

CUDA app1

CUDA Driver

CUDA Runtime

GPU Consolidation Framework

GPU execution processes

(Front-End)

Back-End VirtualContext

CUDA calls arrive from Frontend

GPU Sharing FrameworkGPU-related Node-Level Runtime

CUDA appN

… CUDAInterception

LibraryBack-End

Server

CUDA stream1

Manipulates kernel

configurations to allow GPU

space sharing

CUDA stream2CUDA

streamN

Workload Consolidator

Simplified version of our HPDC’11 runtime

Page 22: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

22

Experimental Setup• 16-node cluster

– CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory– GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory

• 256-job workload– 10 benchmark programs– 3 configurations: small, large, very large datasets– Various application domains: scientific computations, financial analysis, data

mining, machine learning

• Baselines– TORQUE (always optimal resource)– Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99]

Page 23: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Comparison with Torque-based Metrics

23

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Comp. Time-UM Comp. Time-BM Ave. Lat-UM Ave. Lat-BM

Nor

mal

ized

over

Bes

t Cas

e

Metrics

TORQUE MCT LOR FNOR LNORP CORLP

• Baselines suffer from idle resources• By privileging shorter jobs, our schemes reduce queuing delays

Throughput & Latency

COMPLETION TIME AVERAGE LATENCY

10-20% better

~ 20% better

Page 24: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Results with Average Yield Metric

24

0

2

4

6

8

10

25C/75G 50C/50G 75C/25G

Rela

tive

Ave.

Yie

ld

CPU/GPU Job Mix Ratio

Torque MCT FNOR LNORP LOR CORLP

up to 8.8x better

Yield: Effect of Job Mix

Skewed-GPU Skewed-CPUUniform

up to 2.3x better

• Better on skewed job mixes:− More idle time in case of baseline schemes− More room for dynamic mapping

Page 25: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

25

0

1

2

3

4

5

6

7

8

Linear Decay Step Decay

Rela

tive

Ave.

Yie

ld

Value Decay Function

TORQUE MCT LOR FNOR LNORP CORLP

up to 3.8x better

up to 6.9x better

Results with Average Yield MetricYield: Effect of Value Function

• Adaptability of our schemes to different value functions

Page 26: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Results with Average Yield Metric

26

0

2

4

6

8

10

128 256 384 512

Rela

tive

Ave.

Yie

ld

Total No. of Jobs

TORQUE MCT LOR FNOR LNORP CORLP

up to 8.2x better

Yield: Effect of System Load

• As load increases, yield from baselines decreases linearly

• Proposed schemes achieve initially increased yield and then sustained yield

Page 27: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Yield Improvements from Sharing

27

0

5

10

15

20

25

0.1 0.2 0.3 0.4 0.5 0.6

Yiel

d Im

prov

emen

t (%

)

Sharing K Factor

CPU only Sharing Benefit GPU only Sharing Benefit

CPU & GPU Sharing Benefit

Fraction of jobs to share

• Careful space sharing can help performance by freeing resources

• Excessive sharing can be detrimental to performance

Yield: Effect of Sharing

up to 23x improvement

Page 28: ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Summary

28

• Value-based Scheduling on CPU-GPU clusters- Goal: improve aggregate yield

• Coordinated and uncoordinated scheduling schemes for dynamic mapping

• Automatic space sharing of resources based on heuristics

• Prototypical framework for evaluating the proposed schemes

• Improvement over state-of-the-art- Based on completion time & latency- Based on average yield

Conclusion