LaPerm: Locality Aware Scheduler for Dynamic Parallelism ...casl.gatech.edu/wp-content/uploads/2016/04/isca2016-laperm.pdfTHANK YOU! QUESTIONS? 26. Title: Microsoft PowerPoint - isca2016-laperm.pptx

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Jin Wang*, Norm Rubin†, Albert Sidelnik†, Sudhakar Yalamanchili*

1

Sponsors:

*Computer Architecture and System Lab, Georgia Institute of Technology

†NVIDIA Research


Executive Summary

Motivation: New memory reference locality relationship in dynamic parallelism

Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in L1 and L2 cache

Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions

2


Dynamic Parallelism on GPU Launch workload on demand from GPU CUDA Dynamic Parallelism (CDP) OpenCL device-side enqueue Dynamic Thread Block Launch[1] (DTBL)

Benefits Apply to fine-grained parallelism in irregular applications Increase execution efficiency and productivity

3

Launch new kernels

Launch new thread blocks

[1] Wang, Rubin, Sidelnik and Yalamanchili, “Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus”, ISCA 2015

Graph Traversal

0 1 2 3 4 5 Dynamically Launched Child Kernels/TBsLaunch when sufficient parallelism (e.g. 4 threads)

Launched by t1

Launched by t2

Launched by t3

Launched by t4Parent Kernel

0 1 2 3 4 5

Uniform control flow and more memory coalescing

0

1

2

3 45

Workload Imbalance

Reduced Workload Imbalance


Memory Locality in Dynamic Parallelism

New data reference locality relationship in parent-child launchingPotential Locality: Parent-child and child-sibling data sharing L1/L2 locality

Average shared footprint ratio for 8 benchmarks: Parent-child: 38.4% Child-sibling: 30.5%

4

0 1 2 3 4 5

Parent Kernel

Launches

Child Kernels/TBs

SMX

L1 Cache

L2 Cache

SMX

L1 Cache

SMX

L1 Cache

SMX

L1 Cache

GPU

Globa

l Mem

ory

Read/Write Graph Data

Reuse Graph Data


Executive Summary




5


Current GPU Scheduler

First-Come-First-Serve kernel schedulerRound-Robin TB Scheduler

6

SMX SMX SMX SMX

First‐Come‐First‐Serve Kernel Distributor

Round‐Robin SMX Scheduler

TB0 TB1

TB2 TB3

Kernel 1

TB4 TB5

Kernel 2

TB0 TB1 TB2 TB3

TB4 TB5


Current GPU Scheduler (continue)

For non-Dynamic Parallelism scenario Fairness and efficiency

For Dynamic Parallelism scenario Child TBs are scheduled after parent TBs

7

SMX SMX SMX SMX



P0 P1 P2 P3

P4 P5 P6 P7

C0 C1C2 C3

C4 C5

Parent TBs

Child TBs

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1 C2 C3

C4 C5


SMX SMX SMX SMX



Current GPU Scheduler: Issues

Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality

8

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1 C2 C3

C4 C5

SMX

L1

SMX

L1

SMX

L1

SMX

L1

L2 Cache

GPU Global Memory

Child TBs are far away from parent TBs

Potential L2 pollution


Current GPU Scheduler: Issues (continue)

Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality Issue 2: Child TBs are executed on a different SMX than its parent, decreasing L1 localityFails to exploit parent-child or child-sibling locality Exacerbated in real applicationswhich generally have many TBs.

9

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1 C2 C3

C4 C5

SMX

L1

SMX

L1

SMX

L1

SMX

L1

L2 Cache

GPU Global Memory

Child TBs are not able to utilize L1 on the

parent’s SMX


Executive Summary




10


LaPerm: Locality-Aware Scheduler for Dynamic Parallelism

Leverage locality between parent and child TBs

Three scheduling decisions Accommodate different forms of locality

Goal: improve memory efficiency and overall performance

11


Scheduling Decision 1: TB Prioritizing

Prioritize child TBs to be executed immediately after direct parent TBsReuse parent data and avoid L2 cache pollution

12

P0 P1 P2 P3

C0 C1 P4 P5

C2 C3 C4 C5

P6 P7

SMX

L1

SMX

L1

SMX

L1

SMX

L1

L2 Cache

GPU Global Memory

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1C2 C3

C4 C5

Parent TBs

Child TBs

Still no L1 locality!


Scheduling Decision 2: Prioritized SMX Binding

Bind the prioritized child TBs on the same SMX as the parent TBsUtilize L1 cache for data reuse

13

P0 P1 P2 P3

P4 P5 C0 P6

C2 P7 C1

C4

C3

C5

SMX

L1

SMX

L1

SMX

L1

SMX

L1

L2 Cache

GPU Global Memory

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1C2 C3

C4 C5

Parent TBs

Child TBs

SMX load balancing issue!


Scheduling Decision 3: Adaptive Prioritized SMX Binding

Adaptively bind child TBs on available SMXsAvoid SMX load balancing caused by SMX binding

14

P0 P1 P2 P3

P4 P5 C0 P6

C2 P7 C1

C4

C3

C5

SMX

L1

SMX

L1

SMX

L1

SMX

L1

L2 Cache

GPU Global Memory

P0 P1 P2 P3

P4 P5 P6 P7

C0 C1C2 C3

C4 C5

Parent TBs

Child TBs


Architecture Support

Multi-level priority queues Used to store new TB/Kernel information Ordered using priority value Divided among multiple SMXs

15

First‐Come‐First‐Serve Kernel DistributorFirst‐Come‐First‐Serve Kernel Distributor

SMX SchedulerSMX Scheduler

Kernel Management Unit

Kernel Management Unit

Off-chip overflow priority queues

DRAM New Scheduler Logic

On-chip priority queues


Benchmark and Experimental Environment

Benchmark implemented with dynamic parallelismSimulated on GPGPU-Sim

16

Benchmark Input Notation

Adaptive Mesh Refinement Energy data set amr

Barnes Hut Tree Random data set bht

Breadth-First Search Citation network bfs_citation

USA road network bfs_usa_road

Cage15 sparse matrix bfs_cage15

Graph Coloring Citation network clr_citation

Graph 500 clr_graph500

Cage15 sparse matrix clr_cage15

Regular Expression Match Darpa network regx_darpa

Random string collection regx_string

Product Recommendation MovieLens pre

Relational Join Uniform distribution synthetic join_uniform

Gaussian distribution synthetic join_gaussian

Single-Source Shortest Path Citation network sssp_citation

Fight network sssp_flight

Cage15 sparse matrix sssp_cage15

Tree and Graph Applications

Machine Learning

Relational Database

Physics Simulation


Performance of LaPerm: L2 Cache

17

0102030405060708090

100L2

Cac

he H

it Ra

teRR TB-Pri SMX-Bind Adaptive-Bind



L2 Cache hit rate benefits mainly from prioritizing child TBs

18

0102030405060708090

100L2

Cac

he H

it Ra




L2 Cache hit rate benefits mainly from prioritizing child TBsGraph applications with cage15 input have more parent-child data reuse

19

0102030405060708090

100L2

Cac

he H

it Ra




20

0

20

40

60

80

100

L1 C

ache

Hit

Rate

RR TB-Pri SMX-Bind Adaptive-Bind



L1 Cache hit rate benefits mainly from binding child TBs to parents’ SMXs

21

0

20

40

60

80

100

L1 C

ache

Hit

Rate




L1 Cache hit rate benefits mainly from binding child TBs to parents’ SMXs Product recommendation pre has good child-sibling data locality

22

0

20

40

60

80

100

L1 C

ache

Hit

Rate



Performance of LaPerm: IPC

23

0.5

0.7

0.9

1.1

1.3

1.5

1.7No

rmal

ized

IPC

TB-Pri SMX-Bind Adaptive-Bind

Normalized to RR


Performance of LaPerm: IPC

TB-Pri: IPC (1.13x) increases due to higher L2 cache hit rate SMX-Bind: IPC decreases (1.08x) from TB-Pri due to higher L1 cache hit rate but

SMX load balancing Adaptive-Bind: Overall IPC (1.27x) increases because of both memory efficiency

and load balancing24

0.5

0.7

0.9

1.1

1.3

1.5

1.7No

rmal

ized

IPC

TB-Pri SMX-Bind Adaptive-Bind

Normalized to RR


Conclusion

LaPerm: Locality-aware thread block scheduler for dynamic parallelism Exploit new memory reference locality in the parent-child launching

Three scheduling decisions Increase L1/L2 cache locality while maintaining SMX load balance Achieve overall memory system efficiency and performance

improvement

25


THANK YOU!

QUESTIONS?

26

Documents

LaPerm: Locality Aware Scheduler for Dynamic Parallelism ...casl.gatech.edu/wp-content/uploads/2016/04/isca2016-laperm.pdfTHANK YOU! QUESTIONS? 26. Title: Microsoft PowerPoint - isca2016-laperm.pptx