Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)Wei Jiang (The Ohio State University)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Research Laboratories)

1

Rise of Heterogeneous Architectures

• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream

• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”

• Flavors of Heterogeneous computing– Multi-core CPUs + (GPUs/MICs) connected over PCI-E– Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge

• Such hetero. platforms exist in:– 3 out 5 top Supercomputers, large clusters in acad., industry– Many cloud providers: Amazon, Nimbix, SoftLayer …

2

Motivation

• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application

• Software Stack to program CPU-GPU Architectures– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular

• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device

• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to resources– Does not consider desirable scheduling possibilities (using CPU+GPU)

3

Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like

OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

4

Outline


5

Problem Formulations

Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU

and a GPU• Map applications to resources to:

– Maximize overall system throughput– Minimize application latency

Scheduling Formulations:1) Single-Node, Single-Resource Allocation &

Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling

6

Scheduling Formulations

• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-

node apps.– Limited mechanisms to exploit CPU+GPU simultaneously

• Exploit the portability offered by OpenCL prog. Model

7

Single-Node, Single-Resource Allocation & Scheduling

Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation

– Desirable in future to allow flexibility in acceleration of applications

• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce

class of apps. allows such implementations

Outline


8

Challenges and Solution Approach

Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global

throughput– Should consider other possibilities like CPU-only or GPU-only

• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser

nodesSolution Approach:• Take different levels of user inputs (relative

speedups, execution times…)• Design scheduling schemes for each scheduling

formulation9

Outline


10

Scheduling Schemes for First Formulation

11

Two Input Categories & Three Schemes: Categories are based on the amount of input

expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)

performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)Scheme2: Relative Speedup based w/ Conservative Option (RSC)

Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)

Relative-Speedup Aggressive (RSA) or Conservative (RSC)

12

N Jobs, MP[n], GP[n]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Desc. Order

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

YesNo

Assign GJQtop to R

Yes

Assign CJQbottom to R Wait for CPU

Aggressive?

Takes multi-core and GPU speedup as input• Create CPU/GPU

queues• Map jobs to optimal

resource queue

Aggressive, minimizes penalty

ConservativeYes No

Adaptive Shortest Job First (ASJF)

13

N Jobs, MP[n], GP[n], SQ[N]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Asc. Order of SQ

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

Yes

NoAssign GJQtop to R

YesT1= GetMinWaitTimeForNextCPU()

T2k= GetJobWithMinPenOnGPU(CJQ)

T1 > T2kAssign CJQk to R

Yes

No Wait for CPU to become free or for GPU jobs

Minimize latency for short jobs

Automatic switch for aggressive or conservative option

Outline


14

Scheduling Scheme for Second Formulation

15

Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or

CPU+GPU• Molding the # of nodes requested by job

• Consider allocating ½ or ¼th of requested nodesInputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from

profiles

Flexible Moldable Scheduling Scheme (FMS)

16

N Jobs, Exec. Times…

Group Jobs with # of Nodes as the Index

Sort each group based on exec. time of CPU+GPU version

Pick a pair of jobs to schedule in order of sorting

Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node

Gives global view to co-locate on same node

Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each

job

Choose C for one job & G for the other

Co-locate jobs on same set of nodes

Choose same resource for both jobs (C,C)

(G,G) (CG,CG)

2N Nodes Avail?

YesSchedule pair of jobs in parallel

on 2N nodes

No Schedule first job on N nodes

Consider Molding # of nodes for the next job

Outline


17

Cluster Hardware Setup

18

• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520

(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15

GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband

Benchmarks

19

Single-Node Jobs• We use 10 benchmarks

• Scientific, Financial, Datamining, Image Processing applications

• Run each benchmark with 3 different exec. Configurations

• Overall, a pool of 30 jobsMulti-Node Jobs• We use 3 applications

• Gridding kernel, Expectation-Maximization, PageRank• Applications run with 2 different datasets and on 3

different node numbers• Overall, a pool of 18 jobs

Baselines & Metrics

20

Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]

Metrics• Completion Time (Comp. Time)• Application Latency:

• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)

• Maximum Idle Time (Max. Idle Time)

Single-Node Job Results

21

Uniform CPU-GPU Job Mix

CPU-biased Job Mix

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

0.01.02.03.04.05.06.07.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

• 24 Jobs on 2 NodesProposed

schemes

4 different metrics

For each metric

• 108% better than BRR• Within 12% of Manual

Optimal• Tradeoff between non-

optimal penalty vs wait-time for resource• BRR has the highest latency

• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual

optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization

among proposed schemes

Multi-Node Job Results

22

Varying Job Execution Lengths

Varying Resource Request Size

0.60.70.80.9

11.11.21.31.41.5

75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCT

Molding ResType Only Molding NumNodes Only

Molding ResType+NumNodes(FMS)

0.60.70.80.9

11.11.21.31.4

75 SR/25 LR 50 SR/50 LR 25 SR/75 LR

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)

Short Job (SJ), Long Job (LJ)

Small Request (SJ), Large Request (LJ)

Proposed schemes • 32 Jobs on 16

Nodes• FMS, 42% better than best of Torque or MCT

• Each type of molding gives reasonable improvement

• Our schemes utilizes the resource betterhigh throughput

• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT

• Benefit from ResType Molding is better than NumNodes Molding

Outline


23

Conclusions

24

• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem

• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node

jobs• Significant improvement over state-of-the-art

25

Thank You!Questions?

[email protected]@missouri.edu

[email protected]@cse.ohio-state.edu

[email protected]

mailto:[email protected]





Benchmarks – Large Dataset

26

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336

BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options

Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points

KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80

Molecular Dynamics 46.6 12.9 7.9

256000 nodes, 31744000 edges

Benchmarks – Small Dataset

27


GPU Speedup (GP)




BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options

Kmeans 74.2 6.3 7.70.4*10 ^ 9 points

KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80


32000 nodes, 3968000 edges

Benchmarks – Large No. of Iterations

28


GPU Speedup (GP)




BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options

Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points

KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80


256000 nodes, 31744000 edges

Documents

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes