Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Heterogeneous System Coherence for Integrated CPU-GPU Systems
Jason Power*, Arkaprava Basu*, Junli Gu†, Sooraj Puthoor†, Bradford M Beckmann†, Mark D Hill*†, Steven K Reinhardt†, David A Wood*†
Heterogeneous System Bottlenecks
0
1
2
3
4
5
bp bfs hs lud nw km sd bn dct hg mm
Dire
ctor
y ac
cess
es
per G
PU c
ycle
Bandwidth
1
10
100
1000
10000
100000
bp bfs hs lud nw km sd bn dct hg mm
Max
imum
MSH
Rs
Resource Usage
0
1
2
3
4
5
bp bfs hs lud nw km sd bn dct hg mm
Slow
-dow
n
Performance with Constrained Resources
Acknowledgements This work was performed at AMD Research, including student internships.
Wisconsin authors improved the paper's presentation while being partially
supported with NSF grants CCF-0916725, SHF-1017650, CNS-1117280, and CCF-
1218323. The views expressed herein are not necessarily those of the NSF.
Professors Hill and Wood have significant financial interests in AMD.
Conclusions Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence
implementations
‒ High BW difficult to support at directory
‒ Extreme resource requirements
We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence
‒ Reduces bandwidth by 94%
‒ Reduces resource requirements by 95%
Come to our talk, Session 7, 11am on Wednesday!
Discussion
0
1
2
3
4
5
bp bfs hs lud nw km sd bn dct hg mm
norm
alize
d lo
ad la
tenc
y
broadcast baseline HSC
0
0.05
0.1
0.15
0.2
0.25
bp bfs hs lud nw km sd bn dct hg mm
Nor
mal
ized
MSH
Rs
requ
ired
HSC Load Latency
HSC Resource Usage
HSC reduces the required MSHRs at the directory
‒ By more than 4x in all cases
‒ By more than 20x in many cases
HSC reduces the average load latency
Methodology CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 Cache 2 MB GPU Clock 1 GHz Compute Units 32 GPU L1 Data Cache 32 KB GPU Shared L2 Cache 4 MB L3 Memory-side Cache 16 MB Peak Memory Bandwidth 700 GB/s Baseline Directory 262,144 entries Region Directory 32,768 entries MSHRs 32 entries Region Buffer 16,384 entries
Results Summary 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
bp bfs hs lud nw km sd bn dct hg mm
Nor
mal
ized
spee
dup
Broadcast Baseline HSC
HSC Performance
Largest speedup for workloads which constrained
resources hurt the most
Massive bandwidth reduction
‒ Due to offloading data onto direct-access bus
‒ More than theoretical max of 94% in some cases
‒ Region buffers can “prefetch” cache permissions
‒ HSC significantly improves performance over the
baseline design
Decreases bandwidth requirement of directory
Simulation
‒ gem5 for CPU and
memory system
‒ Ruby for caches
‒ GPU modeled off
of GCN
Workloads
‒ Subset of Rodinia
‒ AMD APP SDK
0
0.2
0.4
0.6
0.8
1
1.2
bp bfs hs lud nw km sd bn dct hg mm
Nor
mal
ized
band
wid
th broadcast baseline HSC
Results
HSC Bandwidth
Summary
APU
DRAM Channels
Directory
GPU Cluster
CPU Cluster
L1CPU
Core
L1CPU
Core
L1
L1
CPU
Core
CPU
Core
L2
CU L1
L2
CU L1
CU L1
CU L1
CU L1
CU L1
CU L1
CU L1
L1
L1
L1
L1
L1
L1
L1
L1
CU
CU
CU
CU
CU
CU
CU
CU
Technology drivers
‒ Increasing memory BW
‒ Increasing integration
+ Credit: IBM
System drivers: hUMA
‒ Shared virtual address space
‒ Cache coherence
Key bottlenecks in today’s systems
‒ Directory BW
‒ Resource usage
Heterogeneous System Coherence (HSC)
‒ Leverage coarse-grained sharing between CPU & GPU
‒ Move coherent traffic onto direct-access bus
‒ Bandwidth ↓ 94%, resources ↓ 95%
‒ Average 2x speedup over conventional directory
† *
Baseline Design
Demand
Requests
Cache Tag Arrays
Hit
Miss
Requests
Core Data
Responses
Probe
Requests
Data
Responses
MS
HR
En
trie
s
MSHRs
Coherent
Network
Interface
L2 Cache
Miss
Hit
Miss
Demand
Requests
Cache Tag Arrays
HitCore Data
Responses
Coherent
Network
Interface
Probe
Requests
Region Buffer
Direct Access Bus Interface
Hit
Miss
MS
HR
En
trie
s
MSHRs
L2 Cache
Heterogeneous System Coherence Design
Block Directory Tag Array
PR
En
trie
s
Probe
Request RAM
Coherent
Block Requests
Miss
Hit
Block Probe
Requests/
Responses
MS
HR
En
trie
s
MSHRs
To DRAM
Directory Region Directory Tag Array
Region
Permission
Requests
Miss
Hit
MS
HR
En
trie
sMSHRs
PR
En
trie
s
Probe
Request RAM
Block Probe
Requests/Responses
RegionDirectory
Contact: [email protected] bit.ly/JasonHSC