50
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose

Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

Page 2: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Executive Summary

2

Problem: Address translation overheadslimit the latency hiding capability of a GPU

Key IdeaPrioritize address translation requests over data requests

MASK: a GPU memory hierarchy thatA. Reduces shared TLB contention

B. Improves L2 cache utilization

C. Reduces page walk latency

MASK improves system throughput by 57.8% on average

over state-of-the-art address translation mechanisms

Large performance loss vs. no translation

High contention at the shared TLB

Low L2 cache utilization

Page 3: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Outline

3

• Executive Summary

• Background, Key Challenges and Our Goal

• MASK: A Translation-aware Memory Hierarchy

• Evaluation

• Conclusion

Page 4: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Why Share Discrete GPUs?

4

• Enables multiple GPGPU applications torun concurrently

• Better resource utilization • An application often cannot utilize an entire GPU• Different compute and bandwidth demands

• Enables GPU sharing in the cloud• Multiple users spatially share each GPU

Key requirement: fine-grained memory protection

Page 5: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

GPU Core

Private TLB

State-of-the-art Address Translation in GPUs

5

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(in main memory)

Private TLB Private TLB

Private

Shared

App 1

App 2

Page 6: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

GPU Core

Private TLB

A TLB Miss Stalls Multiple Warps

6

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(in main memory)

Private TLB Private TLB

Private

Shared

App 1

App 2

StalledStalled

Data in a page isshared by many threads

All threadsaccess the same page

Page 7: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

GPU Core

Private TLB

Multiple Page Walks Happen Together

7

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(in main memory)

Private TLB Private TLB

Private

Shared

App 1

App 2

GPU’s parallelism creates parallel page walks

Stalled StalledStalledStalled

Data in a page isshared by many threads

All threadsaccess the same page

Page 8: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0 0.2 0.4 0.6 0.8 1

Ideal SharedTLB PWCache

Effect of Translation on Performance

8

Normalized Performance

Page 9: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0 0.2 0.4 0.6 0.8 1

Ideal SharedTLB PWCache

Effect of Translation on Performance

9

Normalized Performance

Page 10: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

What causes the large performance loss?

0 0.2 0.4 0.6 0.8 1

Ideal SharedTLB PWCache

Effect of Translation on Performance

10

37.4%

Normalized Performance

Page 11: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Problem 1: Contention at the Shared TLB

11

• Multiple GPU applications contend for the TLB

0.00.20.40.60.81.0

App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2

L2 T

LB

Mis

s R

ate

(Lo

we

r is

Be

tte

r) Alone Shared

3DS_HISTO CONS_LPS MUM_HISTO RED_RAY

Page 12: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Problem 1: Contention at the Shared TLB

12

• Multiple GPU applications contend for the TLB

0.00.20.40.60.81.0

App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2

L2 T

LB

Mis

s R

ate

(Lo

we

r is

Be

tte

r) Alone Shared

3DS_HISTO CONS_LPS MUM_HISTO RED_RAY

Contention at the shared TLB leads to lower performance

Page 13: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Problem 2: Thrashing at the L2 Cache

13

• L2 cache can be used to reduce page walk latency Partial translation data can be cached

• Thrashing Source 1: Parallel page walks Different address translation data evicts each other

• Thrashing Source 2: GPU memory intensity Demand-fetched data evicts address translation data

L2 cache is ineffective at reducing page walk latency

Page 14: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Observation: Address Translation Is Latency Sensitive

14

• Multiple warps share data from a single page

0

10

20

30

40

3D

SB

FS

2B

LK

BP

CF

DC

ON

SF

FT

FW

TG

UP

SH

IST

OH

SJP

EG

LIB

LP

SL

UD

LU

HM

MM

UM

NN

NW

QT

CR

AY

RE

DS

AD

SC

SC

AN

SC

PS

PM

VS

RA

DT

RD

Ave

rag

eWarp

s S

talled

Per

On

e T

LB

Mis

s

A single TLB miss causes 8 warps to stall on average

Page 15: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Observation: Address Translation Is Latency Sensitive

15

• Multiple warps share data from a single page

• GPU’s parallelism causes multiple concurrent page walks

0

20

40

60

3D

SB

FS

2B

LK

BP

CF

DC

ON

SF

FT

FW

TG

UP

SH

IST

OH

SJP

EG

LIB

LP

SL

UD

LU

HM

MM

UM

NN

NW

QT

CR

AY

RE

DS

AD

SC

SC

AN

SC

PS

PM

VS

RA

DT

RD

Ave

rag

e

Co

ncu

rren

tP

ag

e W

alk

s

High address translation latency More stalled warps

Page 16: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Our Goals

16

•Reduce shared TLB contention

• Improve L2 cache utilization

•Lower page walk latency

Page 17: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Outline

17

• Executive Summary

• Background, Key Challenges and Our Goal

• MASK: A Translation-aware Memory Hierarchy

• Evaluation

• Conclusion

Page 18: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: A Translation-aware Memory Hierarchy

18

•Reduce shared TLB contentionA. TLB-fill Tokens

• Improve L2 cache utilizationB. Translation-aware L2 Bypass

•Lower page walk latencyC. Address-space-aware Memory Scheduler

Page 19: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

A: TLB-fill Tokens

19

• Goal: Limit the number of warps that can fill the TLB A warp with a token fills the shared TLB

A warp with no token fills a very small bypass cache

• Number of tokens changes based on TLB miss rate Updated every epoch

• Tokens are assigned based on warp ID

Benefit: Limits contention at the shared TLB

Page 20: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: A Translation-aware Memory Hierarchy

20

•Reduce shared TLB contentionA. TLB-fill Tokens

• Improve L2 cache utilizationB. Translation-aware L2 Bypass

•Lower page walk latencyC. Address-space-aware Memory Scheduler

Page 21: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

B: Translation-aware L2 Bypass

21

• L2 hit rate decreases for deep page walk levels

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1

Page 22: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

B: Translation-aware L2 Bypass

22

• L2 hit rate decreases for deep page walk levels

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2

Page 23: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

B: Translation-aware L2 Bypass

23

• L2 hit rate decreases for deep page walk levels

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3

Page 24: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

B: Translation-aware L2 Bypass

24

• L2 hit rate decreases for deep page walk levels

Some address translation data does not benefit from caching

Only cache address translation data with high hit rate

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4

Page 25: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Cache

0 0.2 0.4 0.6 0.8 1

B: Translation-aware L2 Bypass

25

• Goal: Cache address translation data with high hit rate

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4 Bypass

Benefit 1: Better L2 cache utilization for translation data

Benefit 2: Bypassed requests No L2 queuing delay

Average L2 Cache Hit Rate

Page 26: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: A Translation-aware Memory Hierarchy

26

•Reduce shared TLB contentionA. TLB-fill Tokens

• Improve L2 cache utilizationB. Translation-aware L2 Bypass

•Lower page walk latencyC. Address-space-aware Memory Scheduler

Page 27: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

C: Address-space-aware Memory Scheduler

27

• Cause: Address translation requests are treated similarly to data demand requests

Idea: Lower address translation request latency

0

100

200

300

400

500

DR

AM

Late

ncy

0.0

0.2

0.4

0.6

0.8

1.0

DR

AM

Ban

dw

idth

Address Translation Requests Data Demand Requests

Page 28: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

C: Address-space-aware Memory Scheduler

28

• Idea 1: Prioritize address translation requestsover data demand requests

To DRAM

Memory Scheduler

Golden Queue

Address Translation Request

Normal Queue

Data Demand Request

High Priority

Low Priority

Page 29: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

C: Address-space-aware Memory Scheduler

29

• Idea 1: Prioritize address translation requestsover data demand requests

• Idea 2: Improve quality-of-service using the Silver Queue

To DRAM

Memory Scheduler

Golden Queue

Address Translation Request

Silver Queue

Data Demand Request(Applications take turns)

Normal Queue

Data Demand Request

High Priority

Low Priority

Each application takes turn injecting into the Silver Queue

Page 30: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Outline

30

• Executive summary

• Background, Key Challenges and Our Goal

• MASK: A Translation-aware Memory Hierarchy

• Evaluation

• Conclusion

Page 31: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Methodology

31

• Mosaic simulation platform [MICRO ’17] - Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15]

- Models page walks and virtual-to-physical mapping

- Available at https://github.com/CMU-SAFARI/Mosaic

• NVIDIA GTX750 Ti

• Two GPGPU applications execute concurrently

• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites- 3 workload categories based on TLB miss rate

Page 32: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Comparison Points

32

• State-of-the-art CPU–GPU memory management [Power et al., HPCA ’14] PWCache: Page Walk Cache GPU MMU design

SharedTLB: Shared TLB GPU MMU design

• Ideal: Every TLB access is an L1 TLB hit

Page 33: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0.0

0.5

1.0

1.5

2.0

2.5

0-HMR 1-HMR 2-HMR Average

No

rmalized

P

erf

orm

an

ce

PWCache SharedTLB MASK Ideal

Performance

33

57.8%52.0%61.2%58.7%

MASK outperforms state-of-the-art design for every workload

Page 34: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Other Results in the Paper

34

• MASK reduces unfairness

• Effectiveness of each individual component• All three MASK components are effective

• Sensitivity analysis over multiple GPU architectures• MASK improves performance on all evaluated architectures,

including CPU–GPU heterogeneous systems

• Sensitivity analysis to different TLB sizes• MASK improves performance on all evaluated sizes

• Performance improvement over different memory scheduling policies• MASK improves performance over other

state-of-the-art memory schedulers

Page 35: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Conclusion

35

Problem: Address translation overheadslimit the latency hiding capability of a GPU

Key IdeaPrioritize address translation requests over data requests

MASK: a translation-aware GPU memory hierarchyA. TLB-fill Tokens reduces shared TLB contention

B. Translation-aware L2 Bypass improves L2 cache utilization

C. Address-space-aware Memory Scheduler reduces page walk latency

MASK improves system throughput by 57.8% on average

over state-of-the-art address translation mechanisms

Large performance loss vs. no translation

High contention at the shared TLB

Low L2 cache utilization

Page 36: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose

Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

Page 37: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Backup Slides

37

Page 38: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Other Ways to Manage TLB Contention

38

• Prefetching:• Stream prefetcher is ineffective for multiple workloads

• GPU’s parallelism makes it hard to predict which translation data to prefetch

• GPU’s parallelism causes thrashing on the prefetched data

• Reuse-based technique:• Lowers TLB hit rate

• Most pages have similar TLB hit rate

Page 39: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Other Ways to Manage L2 Thrashing

39

• Cache Partitioning• Performs ~3% worse on average compared to

Translation-aware L2 Bypass

• Multiple address translation requests still thrash each other

• Can lead to underutilization

• Lowers hit rate of data requests

• Cache Insertion Policy• Does not yield better hit rate for lower page table level

• Does not benefit from lower queuing latency

Page 40: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Utilizing Large Page?• One single large page size

High demand paging latency

> 90% performance overhead with demand paging

All threads stall during large page PCIe transfer

• Mosaic [Ausavarungnirun et al., MICRO’17] Supports for multiple page sizes

Demand paging happens on small page granularity

Allocates data from the same application in large page granularity

Opportunistically coalesces small page to reduce TLB contention

MASK + Mosaic performs within 5% of the Ideal TLB

Page 41: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

Area and Storage Overhead

41

• Area overhead• <1% area of its original components (Shared TLB, L2$,

Memory Scheduler)

• Storage overhead• TLB-fill Tokens:

• 3.8% extra storage on Shared TLB

• Translation-aware L2 Bypass: • 0.1% extra storage on L2$

• Address-space-aware Memory Scheduler:• 6% extra memory request buffer

Page 42: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0.0

0.5

1.0

1.5

2.0

0-HMR 1-HMR 2-HMR Average

Un

fair

nes

s

PWCache SharedTLB MASK

20.1%

25.0%21.8% 22.4%

Unfairness

MASK is effective at improving fairness

Page 43: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0-HMR 1-HMR 2-HMR Average

We

igh

ted

Sp

eed

up

Static PWCache SharedTLB MASK-TLB

MASK-Cache MASK-DRAM MASK Ideal

Performance vs. Other Baselines

Page 44: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0.0

0.5

1.0

1.5

2.0

0-HMR 1-HMR 2-HMR Average

Un

fair

ness

Static PWCache SharedTLB MASK

22.4%21.8%25.0%20.1%

Unfairness

Page 45: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

3D

S_B

P

3D

S_H

IST

O

BLK

_LP

S

CF

D_M

M

CO

NS

_LP

S

CO

NS

_LU

H

FW

T_B

P

HIS

TO

_G

UP

HIS

TO

_LP

S

LU

H_B

FS

2

LU

H_G

UP

MM

_C

ON

S

MU

M_

HIS

TO

NW

_H

S

NW

_L

PS

RA

Y_G

UP

RA

Y_H

S

RE

D_B

P

RE

D_G

UP

RE

D_M

M

RE

D_R

AY

RE

D_S

C

SC

AN

_C

ON

S

SC

AN

_H

IST

O

SC

AN

_S

AD

SC

AN

_S

RA

D

SC

P_G

UP

SC

P_H

S

SC

_F

WT

SR

AD

_3D

S

TR

D_H

S

TR

D_LP

S

TR

D_M

UM

TR

D_R

AY

TR

D_R

ED

Ave

rag

e

No

rma

lized

DR

AM

Ban

dw

idth

Uti

l.

Address Translation Requests Data Demand Requests

DRAM Utilization Breakdowns

Page 46: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0

200

400

600

800

10003

DS

_B

P

3D

S_

HIS

TO

BLK

_L

PS

CF

D_

MM

CO

NS

_L

PS

CO

NS

_L

UH

FW

T_

BP

HIS

TO

_G

UP

HIS

TO

_L

PS

LU

H_B

FS

2

LU

H_G

UP

MM

_C

ON

S

MU

M_

HIS

TO

NW

_H

S

NW

_L

PS

RA

Y_

GU

P

RA

Y_

HS

RE

D_

BP

RE

D_

GU

P

RE

D_

MM

RE

D_

RA

Y

RE

D_

SC

SC

AN

_C

ON

S

SC

AN

_H

IST

O

SC

AN

_S

AD

SC

AN

_S

RA

D

SC

P_

GU

P

SC

P_

HS

SC

_F

WT

SR

AD

_3D

S

TR

D_

HS

TR

D_

LP

S

TR

D_

MU

M

TR

D_

RA

Y

TR

D_

RE

D

Ave

rag

e

DR

AM

Late

ncy

(Cycle

s)

Address Translation Requests Data Demand Requests

DRAM Latency Breakdowns

Page 47: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0

1

2

3

4

5

HISTO_GUP HISTO_LPS NW_HS NW_LPS RAY_GUP RAY_HS SCP_GUP SCP_HS

Weig

hte

d S

pe

ed

up

Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK

0-HMR Performance

Page 48: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0

1

2

3

4

5

Weig

hte

d S

peed

up

Static PWCache SharedTLB

MASK-TLB MASK-Cache MASK-DRAM

MASK

1-HMR Performance

Page 49: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0

1

2

3

4

5

Weig

hte

d S

peed

up

Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK

2-HMR Performance

Page 50: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

0

0.2

0.4

0.6

0.8

1

No

rmalized

Perf

orm

an

ce

PWCache SharedTLB Ideal

Additional Baseline Performance

50

45.6%