Heterogeneous System Coherence - Microarchbp bfs hs lud nw km sd bn dct hg mm normalized load latency broadcast baseline HSC 0 0.05 0.1 0.15 0.2 0.25 bp bfs hs lud nw km sd bn dct

Heterogeneous System Coherence for Integrated CPU-GPU Systems

Jason Power*, Arkaprava Basu*, Junli Gu†, Sooraj Puthoor†, Bradford M Beckmann†, Mark D Hill*†, Steven K Reinhardt†, David A Wood*†

Heterogeneous System Bottlenecks

0

1

2

3

4

5

bp bfs hs lud nw km sd bn dct hg mm

Dire

ctor

y ac

cess

es

per G

PU c

ycle

Bandwidth

1

10

100

1000

10000

100000


Max

imum

MSH

Rs

Resource Usage

0

1

2

3

4

5


Slow

-dow

n

Performance with Constrained Resources

Acknowledgements This work was performed at AMD Research, including student internships.

Wisconsin authors improved the paper's presentation while being partially

supported with NSF grants CCF-0916725, SHF-1017650, CNS-1117280, and CCF-

1218323. The views expressed herein are not necessarily those of the NSF.

Professors Hill and Wood have significant financial interests in AMD.

Conclusions Hardware coherence can increase the utility of

heterogeneous systems

Major bottlenecks in current coherence

implementations

‒ High BW difficult to support at directory

‒ Extreme resource requirements

We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence

‒ Reduces bandwidth by 94%

‒ Reduces resource requirements by 95%

Come to our talk, Session 7, 11am on Wednesday!

Discussion

0

1

2

3

4

5


norm

alize

d lo

ad la

tenc

y

broadcast baseline HSC

0

0.05

0.1

0.15

0.2

0.25


Nor

mal

ized

MSH

Rs

requ

ired

HSC Load Latency

HSC Resource Usage

HSC reduces the required MSHRs at the directory

‒ By more than 4x in all cases

‒ By more than 20x in many cases

HSC reduces the average load latency

Methodology CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 Cache 2 MB GPU Clock 1 GHz Compute Units 32 GPU L1 Data Cache 32 KB GPU Shared L2 Cache 4 MB L3 Memory-side Cache 16 MB Peak Memory Bandwidth 700 GB/s Baseline Directory 262,144 entries Region Directory 32,768 entries MSHRs 32 entries Region Buffer 16,384 entries

Results Summary 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Nor

mal

ized

spee

dup

Broadcast Baseline HSC

HSC Performance

Largest speedup for workloads which constrained

resources hurt the most

Massive bandwidth reduction

‒ Due to offloading data onto direct-access bus

‒ More than theoretical max of 94% in some cases

‒ Region buffers can “prefetch” cache permissions

‒ HSC significantly improves performance over the

baseline design

Decreases bandwidth requirement of directory

Simulation

‒ gem5 for CPU and

memory system

‒ Ruby for caches

‒ GPU modeled off

of GCN

Workloads

‒ Subset of Rodinia

‒ AMD APP SDK

0

0.2

0.4

0.6

0.8

1

1.2


Nor

mal

ized

band

wid

th broadcast baseline HSC

Results

HSC Bandwidth

Summary

APU

DRAM Channels

Directory

GPU Cluster

CPU Cluster

L1CPU

Core

L1CPU

Core

L1

L1

CPU

Core

CPU

Core

L2

CU L1

L2

CU L1

CU L1

CU L1

CU L1

CU L1

CU L1

CU L1

L1

L1

L1

L1

L1

L1

L1

L1

CU

CU

CU

CU

CU

CU

CU

CU

Technology drivers

‒ Increasing memory BW

‒ Increasing integration

+ Credit: IBM

System drivers: hUMA

‒ Shared virtual address space

‒ Cache coherence

Key bottlenecks in today’s systems

‒ Directory BW

‒ Resource usage

Heterogeneous System Coherence (HSC)

‒ Leverage coarse-grained sharing between CPU & GPU

‒ Move coherent traffic onto direct-access bus

‒ Bandwidth ↓ 94%, resources ↓ 95%

‒ Average 2x speedup over conventional directory

† *

Baseline Design

Demand

Requests

Cache Tag Arrays

Hit

Miss

Requests

Core Data

Responses

Probe

Requests

Data

Responses

MS

HR

En

trie

s

MSHRs

Coherent

Network

Interface

L2 Cache

Miss

Hit

Miss

Demand

Requests

Cache Tag Arrays

HitCore Data

Responses

Coherent

Network

Interface

Probe

Requests

Region Buffer

Direct Access Bus Interface

Hit

Miss

MS

HR

En

trie

s

MSHRs

L2 Cache

Heterogeneous System Coherence Design

Block Directory Tag Array

PR

En

trie

s

Probe

Request RAM

Coherent

Block Requests

Miss

Hit

Block Probe

Requests/

Responses

MS

HR

En

trie

s

MSHRs

To DRAM

Directory Region Directory Tag Array

Region

Permission

Requests

Miss

Hit

MS

HR

En

trie

sMSHRs

PR

En

trie

s

Probe

Request RAM

Block Probe

Requests/Responses

RegionDirectory

Contact: [email protected] bit.ly/JasonHSC

Documents

Heterogeneous System Coherence - Microarchbp bfs hs lud nw km sd bn dct hg mm normalized load latency broadcast baseline HSC 0 0.05 0.1 0.15 0.2 0.25 bp bfs hs lud nw km sd bn dct