1
Heterogeneous System Coherence for Integrated CPU-GPU Systems Jason Power*, Arkaprava Basu*, Junli Gu , Sooraj Puthoor , Bradford M Beckmann , Mark D Hill* , Steven K Reinhardt , David A Wood* Heterogeneous System Bottlenecks 0 1 2 3 4 5 bp bfs hs lud nw km sd bn dct hg mm Directory accesses per GPU cycle Bandwidth 1 10 100 1000 10000 100000 bp bfs hs lud nw km sd bn dct hg mm Maximum MSHRs Resource Usage 0 1 2 3 4 5 bp bfs hs lud nw km sd bn dct hg mm Slow-down Performance with Constrained Resources Acknowledgements This work was performed at AMD Research, including student internships. Wisconsin authors improved the paper's presentation while being partially supported with NSF grants CCF-0916725, SHF-1017650, CNS-1117280, and CCF- 1218323. The views expressed herein are not necessarily those of the NSF. Professors Hill and Wood have significant financial interests in AMD. Conclusions Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High BW difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% Come to our talk, Session 7, 11am on Wednesday! Discussion 0 1 2 3 4 5 bp bfs hs lud nw km sd bn dct hg mm normalized load latency broadcast baseline HSC 0 0.05 0.1 0.15 0.2 0.25 bp bfs hs lud nw km sd bn dct hg mm Normalized MSHRs required HSC Load Latency HSC Resource Usage HSC reduces the required MSHRs at the directory By more than 4x in all cases By more than 20x in many cases HSC reduces the average load latency Methodology CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 Cache 2 MB GPU Clock 1 GHz Compute Units 32 GPU L1 Data Cache 32 KB GPU Shared L2 Cache 4 MB L3 Memory-side Cache 16 MB Peak Memory Bandwidth 700 GB/s Baseline Directory 262,144 entries Region Directory 32,768 entries MSHRs 32 entries Region Buffer 16,384 entries Results Summary 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 bp bfs hs lud nw km sd bn dct hg mm Normalized speedup Broadcast Baseline HSC HSC Performance Largest speedup for workloads which constrained resources hurt the most Massive bandwidth reduction Due to offloading data onto direct-access bus More than theoretical max of 94% in some cases Region buffers can “prefetch” cache permissions HSC significantly improves performance over the baseline design Decreases bandwidth requirement of directory Simulation gem5 for CPU and memory system Ruby for caches GPU modeled off of GCN Workloads Subset of Rodinia AMD APP SDK 0 0.2 0.4 0.6 0.8 1 1.2 bp bfs hs lud nw km sd bn dct hg mm Normalized bandwidth broadcast baseline HSC Results HSC Bandwidth Summary APU DRAM Channels Directory GPU Cluster CPU Cluster L1 CPU Core L1 CPU Core L1 L1 CPU Core CPU Core L2 CU L1 L2 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 CU L1 L1 L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU CU CU Technology drivers Increasing memory BW Increasing integration + Credit: IBM System drivers: hUMA Shared virtual address space Cache coherence Key bottlenecks in today’s systems Directory BW Resource usage Heterogeneous System Coherence (HSC) Leverage coarse-grained sharing between CPU & GPU Move coherent traffic onto direct-access bus Bandwidth 94%, resources 95% Average 2x speedup over conventional directory * Baseline Design Demand Requests Cache Tag Arrays Hit Miss Requests Core Data Responses Probe Requests Data Responses MSHR Entries MSHRs Coherent Network Interface L2 Cache Miss Hit Miss Demand Requests Cache Tag Arrays Hit Core Data Responses Coherent Network Interface Probe Requests Region Buffer Direct Access Bus Interface Hit Miss MSHR Entries MSHRs L2 Cache Heterogeneous System Coherence Design Block Directory Tag Array PR Entries Probe Request RAM Coherent Block Requests Miss Hit Block Probe Requests/ Responses MSHR Entries MSHRs To DRAM Directory Region Directory Tag Array Region Permission Requests Miss Hit MSHR Entries MSHRs PR Entries Probe Request RAM Block Probe Requests/Responses Region Directory Contact: [email protected] bit.ly/JasonHSC

Heterogeneous System Coherence - Microarchbp bfs hs lud nw km sd bn dct hg mm normalized load latency broadcast baseline HSC 0 0.05 0.1 0.15 0.2 0.25 bp bfs hs lud nw km sd bn dct

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Heterogeneous System Coherence - Microarchbp bfs hs lud nw km sd bn dct hg mm normalized load latency broadcast baseline HSC 0 0.05 0.1 0.15 0.2 0.25 bp bfs hs lud nw km sd bn dct

Heterogeneous System Coherence for Integrated CPU-GPU Systems

Jason Power*, Arkaprava Basu*, Junli Gu†, Sooraj Puthoor†, Bradford M Beckmann†, Mark D Hill*†, Steven K Reinhardt†, David A Wood*†

Heterogeneous System Bottlenecks

0

1

2

3

4

5

bp bfs hs lud nw km sd bn dct hg mm

Dire

ctor

y ac

cess

es

per G

PU c

ycle

Bandwidth

1

10

100

1000

10000

100000

bp bfs hs lud nw km sd bn dct hg mm

Max

imum

MSH

Rs

Resource Usage

0

1

2

3

4

5

bp bfs hs lud nw km sd bn dct hg mm

Slow

-dow

n

Performance with Constrained Resources

Acknowledgements This work was performed at AMD Research, including student internships.

Wisconsin authors improved the paper's presentation while being partially

supported with NSF grants CCF-0916725, SHF-1017650, CNS-1117280, and CCF-

1218323. The views expressed herein are not necessarily those of the NSF.

Professors Hill and Wood have significant financial interests in AMD.

Conclusions Hardware coherence can increase the utility of

heterogeneous systems

Major bottlenecks in current coherence

implementations

‒ High BW difficult to support at directory

‒ Extreme resource requirements

We propose Heterogeneous System Coherence ‒ Leverages spatial locality and region coherence

‒ Reduces bandwidth by 94%

‒ Reduces resource requirements by 95%

Come to our talk, Session 7, 11am on Wednesday!

Discussion

0

1

2

3

4

5

bp bfs hs lud nw km sd bn dct hg mm

norm

alize

d lo

ad la

tenc

y

broadcast baseline HSC

0

0.05

0.1

0.15

0.2

0.25

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

MSH

Rs

requ

ired

HSC Load Latency

HSC Resource Usage

HSC reduces the required MSHRs at the directory

‒ By more than 4x in all cases

‒ By more than 20x in many cases

HSC reduces the average load latency

Methodology CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 Cache 2 MB GPU Clock 1 GHz Compute Units 32 GPU L1 Data Cache 32 KB GPU Shared L2 Cache 4 MB L3 Memory-side Cache 16 MB Peak Memory Bandwidth 700 GB/s Baseline Directory 262,144 entries Region Directory 32,768 entries MSHRs 32 entries Region Buffer 16,384 entries

Results Summary 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

spee

dup

Broadcast Baseline HSC

HSC Performance

Largest speedup for workloads which constrained

resources hurt the most

Massive bandwidth reduction

‒ Due to offloading data onto direct-access bus

‒ More than theoretical max of 94% in some cases

‒ Region buffers can  “prefetch”  cache permissions

‒ HSC significantly improves performance over the

baseline design

Decreases bandwidth requirement of directory

Simulation

‒ gem5 for CPU and

memory system

‒ Ruby for caches

‒ GPU modeled off

of GCN

Workloads

‒ Subset of Rodinia

‒ AMD APP SDK

0

0.2

0.4

0.6

0.8

1

1.2

bp bfs hs lud nw km sd bn dct hg mm

Nor

mal

ized

band

wid

th broadcast baseline HSC

Results

HSC Bandwidth

Summary

APU

DRAM Channels

Directory

GPU Cluster

CPU Cluster

L1CPU

Core

L1CPU

Core

L1

L1

CPU

Core

CPU

Core

L2

CU L1

L2

CU L1

CU L1

CU L1

CU L1

CU L1

CU L1

CU L1

L1

L1

L1

L1

L1

L1

L1

L1

CU

CU

CU

CU

CU

CU

CU

CU

Technology drivers

‒ Increasing memory BW

‒ Increasing integration

+ Credit: IBM

System drivers: hUMA

‒ Shared virtual address space

‒ Cache coherence

Key bottlenecks in  today’s  systems

‒ Directory BW

‒ Resource usage

Heterogeneous System Coherence (HSC)

‒ Leverage coarse-grained sharing between CPU & GPU

‒ Move coherent traffic onto direct-access bus

‒ Bandwidth ↓ 94%, resources ↓ 95%

‒ Average 2x speedup over conventional directory

† *

Baseline Design

Demand

Requests

Cache Tag Arrays

Hit

Miss

Requests

Core Data

Responses

Probe

Requests

Data

Responses

MS

HR

En

trie

s

MSHRs

Coherent

Network

Interface

L2 Cache

Miss

Hit

Miss

Demand

Requests

Cache Tag Arrays

HitCore Data

Responses

Coherent

Network

Interface

Probe

Requests

Region Buffer

Direct Access Bus Interface

Hit

Miss

MS

HR

En

trie

s

MSHRs

L2 Cache

Heterogeneous System Coherence Design

Block Directory Tag Array

PR

En

trie

s

Probe

Request RAM

Coherent

Block Requests

Miss

Hit

Block Probe

Requests/

Responses

MS

HR

En

trie

s

MSHRs

To DRAM

Directory Region Directory Tag Array

Region

Permission

Requests

Miss

Hit

MS

HR

En

trie

sMSHRs

PR

En

trie

s

Probe

Request RAM

Block Probe

Requests/Responses

RegionDirectory

Contact: [email protected] bit.ly/JasonHSC