Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
Jin Wang*, Norm Rubin†, Albert Sidelnik†, Sudhakar Yalamanchili*
1
Sponsors:
*Computer Architecture and System Lab, Georgia Institute of Technology
†NVIDIA Research
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Executive Summary
Motivation: New memory reference locality relationship in dynamic parallelism
Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in L1 and L2 cache
Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions
2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Dynamic Parallelism on GPU Launch workload on demand from GPU CUDA Dynamic Parallelism (CDP) OpenCL device-side enqueue Dynamic Thread Block Launch[1] (DTBL)
Benefits Apply to fine-grained parallelism in irregular applications Increase execution efficiency and productivity
3
Launch new kernels
Launch new thread blocks
[1] Wang, Rubin, Sidelnik and Yalamanchili, “Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus”, ISCA 2015
Graph Traversal
0 1 2 3 4 5 Dynamically Launched Child Kernels/TBsLaunch when sufficient parallelism (e.g. 4 threads)
Launched by t1
Launched by t2
Launched by t3
Launched by t4Parent Kernel
0 1 2 3 4 5
Uniform control flow and more memory coalescing
0
1
2
3 45
Workload Imbalance
Reduced Workload Imbalance
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Memory Locality in Dynamic Parallelism
New data reference locality relationship in parent-child launchingPotential Locality: Parent-child and child-sibling data sharing L1/L2 locality
Average shared footprint ratio for 8 benchmarks: Parent-child: 38.4% Child-sibling: 30.5%
4
0 1 2 3 4 5
Parent Kernel
Launches
Child Kernels/TBs
SMX
L1 Cache
L2 Cache
SMX
L1 Cache
SMX
L1 Cache
SMX
L1 Cache
GPU
Globa
l Mem
ory
Read/Write Graph Data
Reuse Graph Data
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Executive Summary
Motivation: New memory reference locality relationship in dynamic parallelism
Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in L1 and L2 cache
Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions
5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Current GPU Scheduler
First-Come-First-Serve kernel schedulerRound-Robin TB Scheduler
6
SMX SMX SMX SMX
First‐Come‐First‐Serve Kernel Distributor
Round‐Robin SMX Scheduler
TB0 TB1
TB2 TB3
Kernel 1
TB4 TB5
Kernel 2
TB0 TB1 TB2 TB3
TB4 TB5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Current GPU Scheduler (continue)
For non-Dynamic Parallelism scenario Fairness and efficiency
For Dynamic Parallelism scenario Child TBs are scheduled after parent TBs
7
SMX SMX SMX SMX
First‐Come‐First‐Serve Kernel Distributor
Round‐Robin SMX Scheduler
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1C2 C3
C4 C5
Parent TBs
Child TBs
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1 C2 C3
C4 C5
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
SMX SMX SMX SMX
First‐Come‐First‐Serve Kernel Distributor
Round‐Robin SMX Scheduler
Current GPU Scheduler: Issues
Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality
8
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1 C2 C3
C4 C5
SMX
L1
SMX
L1
SMX
L1
SMX
L1
L2 Cache
GPU Global Memory
Child TBs are far away from parent TBs
Potential L2 pollution
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Current GPU Scheduler: Issues (continue)
Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality Issue 2: Child TBs are executed on a different SMX than its parent, decreasing L1 localityFails to exploit parent-child or child-sibling locality Exacerbated in real applicationswhich generally have many TBs.
9
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1 C2 C3
C4 C5
SMX
L1
SMX
L1
SMX
L1
SMX
L1
L2 Cache
GPU Global Memory
Child TBs are not able to utilize L1 on the
parent’s SMX
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Executive Summary
Motivation: New memory reference locality relationship in dynamic parallelism
Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in L1 and L2 cache
Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions
10
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LaPerm: Locality-Aware Scheduler for Dynamic Parallelism
Leverage locality between parent and child TBs
Three scheduling decisions Accommodate different forms of locality
Goal: improve memory efficiency and overall performance
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scheduling Decision 1: TB Prioritizing
Prioritize child TBs to be executed immediately after direct parent TBsReuse parent data and avoid L2 cache pollution
12
P0 P1 P2 P3
C0 C1 P4 P5
C2 C3 C4 C5
P6 P7
SMX
L1
SMX
L1
SMX
L1
SMX
L1
L2 Cache
GPU Global Memory
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1C2 C3
C4 C5
Parent TBs
Child TBs
Still no L1 locality!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scheduling Decision 2: Prioritized SMX Binding
Bind the prioritized child TBs on the same SMX as the parent TBsUtilize L1 cache for data reuse
13
P0 P1 P2 P3
P4 P5 C0 P6
C2 P7 C1
C4
C3
C5
SMX
L1
SMX
L1
SMX
L1
SMX
L1
L2 Cache
GPU Global Memory
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1C2 C3
C4 C5
Parent TBs
Child TBs
SMX load balancing issue!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scheduling Decision 3: Adaptive Prioritized SMX Binding
Adaptively bind child TBs on available SMXsAvoid SMX load balancing caused by SMX binding
14
P0 P1 P2 P3
P4 P5 C0 P6
C2 P7 C1
C4
C3
C5
SMX
L1
SMX
L1
SMX
L1
SMX
L1
L2 Cache
GPU Global Memory
P0 P1 P2 P3
P4 P5 P6 P7
C0 C1C2 C3
C4 C5
Parent TBs
Child TBs
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Architecture Support
Multi-level priority queues Used to store new TB/Kernel information Ordered using priority value Divided among multiple SMXs
15
First‐Come‐First‐Serve Kernel DistributorFirst‐Come‐First‐Serve Kernel Distributor
SMX SchedulerSMX Scheduler
Kernel Management Unit
Kernel Management Unit
Off-chip overflow priority queues
DRAM New Scheduler Logic
On-chip priority queues
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Benchmark and Experimental Environment
Benchmark implemented with dynamic parallelismSimulated on GPGPU-Sim
16
Benchmark Input Notation
Adaptive Mesh Refinement Energy data set amr
Barnes Hut Tree Random data set bht
Breadth-First Search Citation network bfs_citation
USA road network bfs_usa_road
Cage15 sparse matrix bfs_cage15
Graph Coloring Citation network clr_citation
Graph 500 clr_graph500
Cage15 sparse matrix clr_cage15
Regular Expression Match Darpa network regx_darpa
Random string collection regx_string
Product Recommendation MovieLens pre
Relational Join Uniform distribution synthetic join_uniform
Gaussian distribution synthetic join_gaussian
Single-Source Shortest Path Citation network sssp_citation
Fight network sssp_flight
Cage15 sparse matrix sssp_cage15
Tree and Graph Applications
Machine Learning
Relational Database
Physics Simulation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L2 Cache
17
0102030405060708090
100L2
Cac
he H
it Ra
teRR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L2 Cache
L2 Cache hit rate benefits mainly from prioritizing child TBs
18
0102030405060708090
100L2
Cac
he H
it Ra
teRR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L2 Cache
L2 Cache hit rate benefits mainly from prioritizing child TBsGraph applications with cage15 input have more parent-child data reuse
19
0102030405060708090
100L2
Cac
he H
it Ra
teRR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L1 Cache
20
0
20
40
60
80
100
L1 C
ache
Hit
Rate
RR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L1 Cache
L1 Cache hit rate benefits mainly from binding child TBs to parents’ SMXs
21
0
20
40
60
80
100
L1 C
ache
Hit
Rate
RR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: L1 Cache
L1 Cache hit rate benefits mainly from binding child TBs to parents’ SMXs Product recommendation pre has good child-sibling data locality
22
0
20
40
60
80
100
L1 C
ache
Hit
Rate
RR TB-Pri SMX-Bind Adaptive-Bind
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: IPC
23
0.5
0.7
0.9
1.1
1.3
1.5
1.7No
rmal
ized
IPC
TB-Pri SMX-Bind Adaptive-Bind
Normalized to RR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Performance of LaPerm: IPC
TB-Pri: IPC (1.13x) increases due to higher L2 cache hit rate SMX-Bind: IPC decreases (1.08x) from TB-Pri due to higher L1 cache hit rate but
SMX load balancing Adaptive-Bind: Overall IPC (1.27x) increases because of both memory efficiency
and load balancing24
0.5
0.7
0.9
1.1
1.3
1.5
1.7No
rmal
ized
IPC
TB-Pri SMX-Bind Adaptive-Bind
Normalized to RR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Conclusion
LaPerm: Locality-aware thread block scheduler for dynamic parallelism Exploit new memory reference locality in the parent-child launching
Three scheduling decisions Increase L1/L2 cache locality while maintaining SMX load balance Achieve overall memory system efficiency and performance
improvement
25
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
THANK YOU!
QUESTIONS?
26