28
cheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang (The Ohio State University) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Research Laboratories) 1

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

  • Upload
    tanith

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang (The Ohio State University) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Research Laboratories). - PowerPoint PPT Presentation

Citation preview

Page 1: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Vignesh Ravi (The Ohio State University)Michela Becchi (University of Missouri)Wei Jiang (The Ohio State University)

Gagan Agrawal (The Ohio State University)Srimat Chakradhar (NEC Research Laboratories)

1

Page 2: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Rise of Heterogeneous Architectures

• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream

• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”

• Flavors of Heterogeneous computing– Multi-core CPUs + (GPUs/MICs) connected over PCI-E– Integrated CPU-GPUs like AMD Fusion, Intel Sandy Bridge

• Such hetero. platforms exist in:– 3 out 5 top Supercomputers, large clusters in acad., industry– Many cloud providers: Amazon, Nimbix, SoftLayer …

2

Page 3: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Motivation

• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application

• Software Stack to program CPU-GPU Architectures– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular

• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device

• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to resources– Does not consider desirable scheduling possibilities (using CPU+GPU)

3

Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like

OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations

Page 4: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

4

Page 5: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

5

Page 6: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Problem Formulations

Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU

and a GPU• Map applications to resources to:

– Maximize overall system throughput– Minimize application latency

Scheduling Formulations:1) Single-Node, Single-Resource Allocation &

Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling

6

Page 7: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Scheduling Formulations

• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-

node apps.– Limited mechanisms to exploit CPU+GPU simultaneously

• Exploit the portability offered by OpenCL prog. Model

7

Single-Node, Single-Resource Allocation & Scheduling

Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation

– Desirable in future to allow flexibility in acceleration of applications

• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce

class of apps. allows such implementations

Page 8: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

8

Page 9: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Challenges and Solution Approach

Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global

throughput– Should consider other possibilities like CPU-only or GPU-only

• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser

nodesSolution Approach:• Take different levels of user inputs (relative

speedups, execution times…)• Design scheduling schemes for each scheduling

formulation9

Page 10: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

10

Page 11: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Scheduling Schemes for First Formulation

11

Two Input Categories & Three Schemes: Categories are based on the amount of input

expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)

performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)Scheme2: Relative Speedup based w/ Conservative Option (RSC)

Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)

Page 12: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Relative-Speedup Aggressive (RSA) or Conservative (RSC)

12

N Jobs, MP[n], GP[n]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Desc. Order

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

YesNo

Assign GJQtop to R

Yes

Assign CJQbottom to R Wait for CPU

Aggressive?

Takes multi-core and GPU speedup as input• Create CPU/GPU

queues• Map jobs to optimal

resource queue

Aggressive, minimizes penalty

ConservativeYes No

Page 13: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Adaptive Shortest Job First (ASJF)

13

N Jobs, MP[n], GP[n], SQ[N]

Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)

Sort CJQ and GJQ in Asc. Order of SQ

R=GetNextResourceAvialable()

IsGPU

GJQ Empty?

Yes

NoAssign GJQtop to R

YesT1= GetMinWaitTimeForNextCPU()

T2k= GetJobWithMinPenOnGPU(CJQ)

T1 > T2kAssign CJQk to R

Yes

No Wait for CPU to become free or for GPU jobs

Minimize latency for short jobs

Automatic switch for aggressive or conservative option

Page 14: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

14

Page 15: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Scheduling Scheme for Second Formulation

15

Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or

CPU+GPU• Molding the # of nodes requested by job

• Consider allocating ½ or ¼th of requested nodesInputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from

profiles

Page 16: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Flexible Moldable Scheduling Scheme (FMS)

16

N Jobs, Exec. Times…

Group Jobs with # of Nodes as the Index

Sort each group based on exec. time of CPU+GPU version

Pick a pair of jobs to schedule in order of sorting

Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node

Gives global view to co-locate on same node

Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each

job

Choose C for one job & G for the other

Co-locate jobs on same set of nodes

Choose same resource for both jobs (C,C)

(G,G) (CG,CG)

2N Nodes Avail?

YesSchedule pair of jobs in parallel

on 2N nodes

No Schedule first job on N nodes

Consider Molding # of nodes for the next job

Page 17: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

17

Page 18: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Cluster Hardware Setup

18

• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520

(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15

GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband

Page 19: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Benchmarks

19

Single-Node Jobs• We use 10 benchmarks

• Scientific, Financial, Datamining, Image Processing applications

• Run each benchmark with 3 different exec. Configurations

• Overall, a pool of 30 jobsMulti-Node Jobs• We use 3 applications

• Gridding kernel, Expectation-Maximization, PageRank• Applications run with 2 different datasets and on 3

different node numbers• Overall, a pool of 18 jobs

Page 20: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Baselines & Metrics

20

Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]

Metrics• Completion Time (Comp. Time)• Application Latency:

• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)

• Maximum Idle Time (Max. Idle Time)

Page 21: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Single-Node Job Results

21

Uniform CPU-GPU Job Mix

CPU-biased Job Mix

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

0.01.02.03.04.05.06.07.0

Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time

Nor

mal

ized

Ove

r Bes

t Cas

e

Metrics

BRR RSA RSC ASJF Manual Optimal

• 24 Jobs on 2 NodesProposed

schemes

4 different metrics

For each metric

• 108% better than BRR• Within 12% of Manual

Optimal• Tradeoff between non-

optimal penalty vs wait-time for resource• BRR has the highest latency

• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual

optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization

among proposed schemes

Page 22: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Multi-Node Job Results

22

Varying Job Execution Lengths

Varying Resource Request Size

0.60.70.80.9

11.11.21.31.41.5

75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCT

Molding ResType Only Molding NumNodes Only

Molding ResType+NumNodes(FMS)

0.60.70.80.9

11.11.21.31.4

75 SR/25 LR 50 SR/50 LR 25 SR/75 LR

Nor

mal

ized

Com

pleti

on T

ime

Job Mix

Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)

Short Job (SJ), Long Job (LJ)

Small Request (SJ), Large Request (LJ)

Proposed schemes • 32 Jobs on 16

Nodes• FMS, 42% better than best of Torque or MCT

• Each type of molding gives reasonable improvement

• Our schemes utilizes the resource betterhigh throughput

• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT

• Benefit from ResType Molding is better than NumNodes Molding

Page 23: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Outline

• Problem Formulation• Challenges and Solution Approach• Scheduling of Single-Node, Single-Resource Jobs• Scheduling of Multi-node, Multi-Resource Jobs• Experimental Results• Conclusions

23

Page 24: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Conclusions

24

• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem

• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node

jobs• Significant improvement over state-of-the-art

Page 26: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Benchmarks – Large Dataset

26

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336

BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options

Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points

KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80

Molecular Dynamics 46.6 12.9 7.9

256000 nodes, 31744000 edges

Page 27: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Benchmarks – Small Dataset

27

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 1.8 3.8 7.17168*7168Image Processing 8.4 5.6 7.57168*7168FDTD 2.1 1.3 7.77168*7168

BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options

Kmeans 74.2 6.3 7.70.4*10 ^ 9 points

KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80

Molecular Dynamics 6.7 12.8 7.3

32000 nodes, 3968000 edges

Page 28: Scheduling Concurrent Applications on a  Cluster of CPU-GPU Nodes

Benchmarks – Large No. of Iterations

28

BenchmarksSeq. CPU Exec. (sec)

GPU Speedup (GP)

Multicore Speedup (MP)

Data set Characteristics

PDE Solver 722.1 4.3 8.114336*14336Image Processing 3385.5 4.8 8.014336*14336FDTD 423.3 1.8 7.914336*14336

BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options

Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points

KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80

Molecular Dynamics 593.8 20.8 7.8

256000 nodes, 31744000 edges