Hadi Jooybar GPUDet: A Deterministic GPU Architecture 1
GPUDet: A Deterministic GPU Architecture
Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3, Tor M. Aamodt1
1The University of British Columbia2AMD Research3University of Washington
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 2
• GPUs are …
• Fast
• Energy efficient
• Commodity hardware
But…
× Mostly use for certain range of applications
Why?
Communication among concurrent threads 1000s of Threads
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 3
0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }
V0
V2V1
Cost = -Active = -
Cost = -Active = -
V0
V2V1
Cost = 1Active = 1
Cost = 1Active = 1
V0
V2V1
Cost = 1Active = 1
Cost = 2Active = 1
Motivation
BFS algorithmPublished in HiPC 2007
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 4
I will debug it this time
What about debuggers?!
The bug may appear occasionally or in different places in each run.
OMG! Where was that bug?!
Motivation
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 5
GPUDetStrong Determinism (hardware proposal)
Same Outputs Same Execution Path
Makes the program easier to Debug Test
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 6
0 __global__ void BFS_step_kernel(...) {1 if( active[tid] ) {2 active[tid] = false;3 visited[tid] = true;4 foreach (int id = neighbour_nodes){5 if( visited[id] == false ){6 cost[id] = cost[tid] + 1;7 active[id] = true;8 *over = true;9 } } } }
V0
V2V1
Cost = 1Active = 1
Cost = 2Active = 1
Motivation
BFS algorithmPublished in HiPC 2007
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 7
GPUDetStrong Determinism
Same Outputs Same Execution Path
Makes the program easier to Debug Test
×There is no free lunch× Performance Overhead
Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 8
DRAMGPU Architecture
Compute Unit
Memory Unit
L1 Cache
ALUALUALU
DRAML2 Cache
Workgroups
CPUKernel launch
workgroup 2workgroup 1workgroup 0
x = input[threadID];y= func(x);output[threadID] = y;
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 9
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 10
Normal Execution
T0
T1
T2
T3
Deterministic GPU Execution Challenges
• Isolation mechanism
• Provide method to pause execution of a thread
…Quantum 0
T0
T1
T2
T3
Quantum n
T0
T1
T2
T3
…Isolation
T0
T1
T2
T3
Communication Isolation
T0
T1
T2
T3
Communication
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 11
…
Deterministic GPU Execution Challenges
• Isolation mechanism
• Lack of private caches
• Lack of cache coherency
• Provide method to pause execution of a thread
• Single Instruction Multiple Threads (SIMT)
• Potential deadlock condition
• Major changes in control flow hardware
• Performance overheadworkgroupn
wavefront
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 12
Deterministic GPU Execution Challenges
• Very large number of threads
• Expensive global synchronization
• Expensive serialization
• Different program properties
• Large number of short running threads
• Frequent workgroup synchronization
• Less locality in intra thread memory accesses
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 13
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 14
if (tid < 16) x[tid%2] = tid;
x[0] = 0
T0
Coalescing Unit
x[1] = 1
T1
x[0] = 2
T2
x[1] = 15
T15
Deterministic Execution of a Wavefront
Data RaceMask v v - - - - - - … -
Address x
Data 14 15 - - - - - - … -
x[0] = 14 x[1] = 15 Not modifiedTo memory
…
Execution of one wavefront is deterministic
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 15
Deterministic GPU Execution Challenges
• Isolation mechanism
• Provide method to pause execution of a thread
…Isolation
T0
T1
T2
T3
Communication Isolation
T0
T1
T2
T3
Communication
wavefront granularity
not a challenge anymore
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 16
Reaching Quantum Boundary
Global Memory
Read Only
Store Buffers
Local Memory
Wavefronts
…
Load Op CommitAtomic Op
• GPUDet-Basic
1. Instruction Count2. Atomic Operations3. Memory Fences4. Workgroup Barriers5. Execution Complete
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 17
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 18
Workgroup-Aware Quantum Formation
• Extra global synchronizations
Load Imbalance
Reducing number of synchronizationsAvoid unnecessary quantum termination
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 19
AES
BFSr
BFSf
CFD C
P
HO
TSP
LIB
LPS
SRA
D HT
ATM
CLop
t
0%
20%
40%
60%
80%
100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers
%of
Ter
min
ation
Rea
sons
Workgroup-Aware Quantum Formation
Quanta are finished by workgroup barriers
All reach a workgroup barrier
Continue execution in the parallel mode
Workgroup-Aware Decision Making
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 20
AES
BFSr
BFSf
CFD C
P
HO
TSP
LIB
LPS
SRA
D HT
ATM
CLop
t
0%
20%
40%
60%
80%
100% Atomic OperationsInstruction CountExecution CompleteWorkgroup Barriers
%of
Ter
min
ation
Rea
sons
Finish execution of the Kernel function
Workgroup-Aware Decision Making
Workgroup-Aware Quantum Formation
Deterministic workgroup partitioning
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 21
Deterministic Parallel Commit using the Z-Buffer Unit
∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞
∞ ∞ ∞ ∞ ∞ ∞∞ ∞ ∞ ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞7 7 7 ∞ ∞ ∞
8 8 8 8 8 88 8 8 8 8 87 7 7 8 8 87 7 7 8 8 87 7 7 8 8 8
8 8 5 5 8 88 8 5 5 5 87 5 5 5 5 57 5 5 5 5 55 5 5 5 5 5
Depth Buffer
Store Buffer Contents ≈ Color Values
Wavefront ID ≈ Depth Values
Z-Buffer Unit
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 22
• GPUs preserve Point to Point Ordering
A
A
A
A
A
A
Serialization is only among compute units
Compute Unit Level Serialization
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 23
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 24
Results
AES
BFSr
BFSf
CFD C
P
HO
TSP
LIB
LPS
SRA
D HT
ATM
CLop
t00.5
11.5
22.5
33.5
44.5
5
Serial Mode
Commit Mode
Parallel Mode
Nor
mal
ized
Ex
ecuti
on T
ime
2x Slowdown
• GPGPU-Sim 3.0.2Applications with atomic operations
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 25
20% Performance Improvement for application with barriers
19% Performance Improvement for application with small kernel functions
Quantum FormationA
ES
BFSr
BFSf
CFD C
P H LIB
LPS
SRA
D HT
ATM
CLop
t
AVG
0
1
2
3
4
5
GPUDet-baseWorkgroup BarrierEnd of the Kernel
Nor
mal
ized
Exec
ution
Tim
e
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 26
Deterministic Parallel Commit using the Z-Buffer UnitZ-
Buff
er
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
Z-Bu
ffer
Lock
ing
AES BFSr BFSf CFD CP HOTSP LIB LPS SRAD HT ATM Clopt
0
2
4
6
8
10#REF! #REF!
Nor
mal
ized
Exe
cutio
n Ti
me
60% Performance Improvement on Average
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 27
Compute Unit Level Serialization
W-S
er
CU-S
er
W-S
er
CU-S
er
W-S
er
CU-S
er
CLopt HT ATM
02468
101214
Serial Mode Series2Series1
Nor
mal
ize
Exec
ution
Tim
e6.1x Performance Improvement in
Serial Mode
Hadi Jooybar GPUDet: A Deterministic GPU Architecture 28
Conclusion
• Encourages programmers to use GPUs in broader
range of applications
• Exploits GPU characteristics to reduce performance
overhead• Deterministic execution within a wavefront
• Workgroup-aware quantum formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
Questions?