MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose

Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

Executive Summary

2

Problem: Address translation overheadslimit the latency hiding capability of a GPU

Key IdeaPrioritize address translation requests over data requests

MASK: a GPU memory hierarchy thatA. Reduces shared TLB contention

B. Improves L2 cache utilization

C. Reduces page walk latency

MASK improves system throughput by 57.8% on average

over state-of-the-art address translation mechanisms

Large performance loss vs. no translation

High contention at the shared TLB

Low L2 cache utilization

Outline

3

• Executive Summary

• Background, Key Challenges and Our Goal

• MASK: A Translation-aware Memory Hierarchy

• Evaluation

• Conclusion

Why Share Discrete GPUs?

4

• Enables multiple GPGPU applications torun concurrently

• Better resource utilization • An application often cannot utilize an entire GPU• Different compute and bandwidth demands

• Enables GPU sharing in the cloud• Multiple users spatially share each GPU

Key requirement: fine-grained memory protection

GPU Core

Private TLB

State-of-the-art Address Translation in GPUs

5

GPU CoreGPU CoreGPU Core

Shared TLB

Private TLB

Page Table Walkers

Page Table(in main memory)

Private TLB Private TLB

Private

Shared

App 1

App 2

GPU Core

Private TLB

A TLB Miss Stalls Multiple Warps

6


Shared TLB

Private TLB

Page Table Walkers



Private

Shared

App 1

App 2

StalledStalled

Data in a page isshared by many threads

All threadsaccess the same page

GPU Core

Private TLB

Multiple Page Walks Happen Together

7


Shared TLB

Private TLB

Page Table Walkers



Private

Shared

App 1

App 2

GPU’s parallelism creates parallel page walks

Stalled StalledStalledStalled

Data in a page isshared by many threads

All threadsaccess the same page

0 0.2 0.4 0.6 0.8 1

Ideal SharedTLB PWCache

Effect of Translation on Performance

8

Normalized Performance

0 0.2 0.4 0.6 0.8 1



9


What causes the large performance loss?

0 0.2 0.4 0.6 0.8 1



10

37.4%


Problem 1: Contention at the Shared TLB

11

• Multiple GPU applications contend for the TLB

0.00.20.40.60.81.0

App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2

L2 T

LB

Mis

s R

ate

(Lo

we

r is

Be

tte

r) Alone Shared

3DS_HISTO CONS_LPS MUM_HISTO RED_RAY

Problem 1: Contention at the Shared TLB

12

• Multiple GPU applications contend for the TLB

0.00.20.40.60.81.0

App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2

L2 T

LB

Mis

s R

ate

(Lo

we

r is

Be

tte

r) Alone Shared

3DS_HISTO CONS_LPS MUM_HISTO RED_RAY

Contention at the shared TLB leads to lower performance

Problem 2: Thrashing at the L2 Cache

13

• L2 cache can be used to reduce page walk latency Partial translation data can be cached

• Thrashing Source 1: Parallel page walks Different address translation data evicts each other

• Thrashing Source 2: GPU memory intensity Demand-fetched data evicts address translation data

L2 cache is ineffective at reducing page walk latency

Observation: Address Translation Is Latency Sensitive

14

• Multiple warps share data from a single page

0

10

20

30

40

3D

SB

FS

2B

LK

BP

CF

DC

ON

SF

FT

FW

TG

UP

SH

IST

OH

SJP

EG

LIB

LP

SL

UD

LU

HM

MM

UM

NN

NW

QT

CR

AY

RE

DS

AD

SC

SC

AN

SC

PS

PM

VS

RA

DT

RD

Ave

rag

eWarp

s S

talled

Per

On

e T

LB

Mis

s

A single TLB miss causes 8 warps to stall on average

Observation: Address Translation Is Latency Sensitive

15

• Multiple warps share data from a single page

• GPU’s parallelism causes multiple concurrent page walks

0

20

40

60

3D

SB

FS

2B

LK

BP

CF

DC

ON

SF

FT

FW

TG

UP

SH

IST

OH

SJP

EG

LIB

LP

SL

UD

LU

HM

MM

UM

NN

NW

QT

CR

AY

RE

DS

AD

SC

SC

AN

SC

PS

PM

VS

RA

DT

RD

Ave

rag

e

Co

ncu

rren

tP

ag

e W

alk

s

High address translation latency More stalled warps

Our Goals

16

•Reduce shared TLB contention

• Improve L2 cache utilization

•Lower page walk latency

Outline

17

• Executive Summary



• Evaluation

• Conclusion

MASK: A Translation-aware Memory Hierarchy

18

•Reduce shared TLB contentionA. TLB-fill Tokens

• Improve L2 cache utilizationB. Translation-aware L2 Bypass

•Lower page walk latencyC. Address-space-aware Memory Scheduler

A: TLB-fill Tokens

19

• Goal: Limit the number of warps that can fill the TLB A warp with a token fills the shared TLB

A warp with no token fills a very small bypass cache

• Number of tokens changes based on TLB miss rate Updated every epoch

• Tokens are assigned based on warp ID

Benefit: Limits contention at the shared TLB


20




B: Translation-aware L2 Bypass

21

• L2 hit rate decreases for deep page walk levels

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1


22


0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2


23


0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3


24


Some address translation data does not benefit from caching

Only cache address translation data with high hit rate

0 0.2 0.4 0.6 0.8 1

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4

Cache

0 0.2 0.4 0.6 0.8 1


25

• Goal: Cache address translation data with high hit rate

L2 Cache Hit Rate

Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4 Bypass

Benefit 1: Better L2 cache utilization for translation data

Benefit 2: Bypassed requests No L2 queuing delay

Average L2 Cache Hit Rate


26




C: Address-space-aware Memory Scheduler

27

• Cause: Address translation requests are treated similarly to data demand requests

Idea: Lower address translation request latency

0

100

200

300

400

500

DR

AM

Late

ncy

0.0

0.2

0.4

0.6

0.8

1.0

DR

AM

Ban

dw

idth

Address Translation Requests Data Demand Requests


28

• Idea 1: Prioritize address translation requestsover data demand requests

To DRAM

Memory Scheduler

Golden Queue

Address Translation Request

Normal Queue

Data Demand Request

High Priority

Low Priority


29

• Idea 1: Prioritize address translation requestsover data demand requests

• Idea 2: Improve quality-of-service using the Silver Queue

To DRAM

Memory Scheduler

Golden Queue

Address Translation Request

Silver Queue

Data Demand Request(Applications take turns)

Normal Queue

Data Demand Request

High Priority

Low Priority

Each application takes turn injecting into the Silver Queue

Outline

30

• Executive summary



• Evaluation

• Conclusion

Methodology

31

• Mosaic simulation platform [MICRO ’17] - Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15]

- Models page walks and virtual-to-physical mapping

- Available at https://github.com/CMU-SAFARI/Mosaic

• NVIDIA GTX750 Ti

• Two GPGPU applications execute concurrently

• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites- 3 workload categories based on TLB miss rate

Comparison Points

32

• State-of-the-art CPU–GPU memory management [Power et al., HPCA ’14] PWCache: Page Walk Cache GPU MMU design

SharedTLB: Shared TLB GPU MMU design

• Ideal: Every TLB access is an L1 TLB hit

0.0

0.5

1.0

1.5

2.0

2.5

0-HMR 1-HMR 2-HMR Average

No

rmalized

P

erf

orm

an

ce

PWCache SharedTLB MASK Ideal

Performance

33

57.8%52.0%61.2%58.7%

MASK outperforms state-of-the-art design for every workload

Other Results in the Paper

34

• MASK reduces unfairness

• Effectiveness of each individual component• All three MASK components are effective

• Sensitivity analysis over multiple GPU architectures• MASK improves performance on all evaluated architectures,

including CPU–GPU heterogeneous systems

• Sensitivity analysis to different TLB sizes• MASK improves performance on all evaluated sizes

• Performance improvement over different memory scheduling policies• MASK improves performance over other

state-of-the-art memory schedulers

Conclusion

35

Problem: Address translation overheadslimit the latency hiding capability of a GPU

Key IdeaPrioritize address translation requests over data requests

MASK: a translation-aware GPU memory hierarchyA. TLB-fill Tokens reduces shared TLB contention

B. Translation-aware L2 Bypass improves L2 cache utilization

C. Address-space-aware Memory Scheduler reduces page walk latency

MASK improves system throughput by 57.8% on average

over state-of-the-art address translation mechanisms

Large performance loss vs. no translation

High contention at the shared TLB

Low L2 cache utilization

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose

Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

Backup Slides

37

Other Ways to Manage TLB Contention

38

• Prefetching:• Stream prefetcher is ineffective for multiple workloads

• GPU’s parallelism makes it hard to predict which translation data to prefetch

• GPU’s parallelism causes thrashing on the prefetched data

• Reuse-based technique:• Lowers TLB hit rate

• Most pages have similar TLB hit rate

Other Ways to Manage L2 Thrashing

39

• Cache Partitioning• Performs ~3% worse on average compared to

Translation-aware L2 Bypass

• Multiple address translation requests still thrash each other

• Can lead to underutilization

• Lowers hit rate of data requests

• Cache Insertion Policy• Does not yield better hit rate for lower page table level

• Does not benefit from lower queuing latency

Utilizing Large Page?• One single large page size

High demand paging latency

> 90% performance overhead with demand paging

All threads stall during large page PCIe transfer

• Mosaic [Ausavarungnirun et al., MICRO’17] Supports for multiple page sizes

Demand paging happens on small page granularity

Allocates data from the same application in large page granularity

Opportunistically coalesces small page to reduce TLB contention

MASK + Mosaic performs within 5% of the Ideal TLB

Area and Storage Overhead

41

• Area overhead• <1% area of its original components (Shared TLB, L2$,

Memory Scheduler)

• Storage overhead• TLB-fill Tokens:

• 3.8% extra storage on Shared TLB

• Translation-aware L2 Bypass: • 0.1% extra storage on L2$

• Address-space-aware Memory Scheduler:• 6% extra memory request buffer

0.0

0.5

1.0

1.5

2.0


Un

fair

nes

s

PWCache SharedTLB MASK

20.1%

25.0%21.8% 22.4%

Unfairness

MASK is effective at improving fairness

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5


We

igh

ted

Sp

eed

up

Static PWCache SharedTLB MASK-TLB

MASK-Cache MASK-DRAM MASK Ideal

Performance vs. Other Baselines

0.0

0.5

1.0

1.5

2.0


Un

fair

ness

Static PWCache SharedTLB MASK

22.4%21.8%25.0%20.1%

Unfairness

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

3D

S_B

P

3D

S_H

IST

O

BLK

_LP

S

CF

D_M

M

CO

NS

_LP

S

CO

NS

_LU

H

FW

T_B

P

HIS

TO

_G

UP

HIS

TO

_LP

S

LU

H_B

FS

2

LU

H_G

UP

MM

_C

ON

S

MU

M_

HIS

TO

NW

_H

S

NW

_L

PS

RA

Y_G

UP

RA

Y_H

S

RE

D_B

P

RE

D_G

UP

RE

D_M

M

RE

D_R

AY

RE

D_S

C

SC

AN

_C

ON

S

SC

AN

_H

IST

O

SC

AN

_S

AD

SC

AN

_S

RA

D

SC

P_G

UP

SC

P_H

S

SC

_F

WT

SR

AD

_3D

S

TR

D_H

S

TR

D_LP

S

TR

D_M

UM

TR

D_R

AY

TR

D_R

ED

Ave

rag

e

No

rma

lized

DR

AM

Ban

dw

idth

Uti

l.


DRAM Utilization Breakdowns

0

200

400

600

800

10003

DS

_B

P

3D

S_

HIS

TO

BLK

_L

PS

CF

D_

MM

CO

NS

_L

PS

CO

NS

_L

UH

FW

T_

BP

HIS

TO

_G

UP

HIS

TO

_L

PS

LU

H_B

FS

2

LU

H_G

UP

MM

_C

ON

S

MU

M_

HIS

TO

NW

_H

S

NW

_L

PS

RA

Y_

GU

P

RA

Y_

HS

RE

D_

BP

RE

D_

GU

P

RE

D_

MM

RE

D_

RA

Y

RE

D_

SC

SC

AN

_C

ON

S

SC

AN

_H

IST

O

SC

AN

_S

AD

SC

AN

_S

RA

D

SC

P_

GU

P

SC

P_

HS

SC

_F

WT

SR

AD

_3D

S

TR

D_

HS

TR

D_

LP

S

TR

D_

MU

M

TR

D_

RA

Y

TR

D_

RE

D

Ave

rag

e

DR

AM

Late

ncy

(Cycle

s)


DRAM Latency Breakdowns

0

1

2

3

4

5

HISTO_GUP HISTO_LPS NW_HS NW_LPS RAY_GUP RAY_HS SCP_GUP SCP_HS

Weig

hte

d S

pe

ed

up

Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK

0-HMR Performance

0

1

2

3

4

5

Weig

hte

d S

peed

up

Static PWCache SharedTLB

MASK-TLB MASK-Cache MASK-DRAM

MASK

1-HMR Performance

0

1

2

3

4

5

Weig

hte

d S

peed

up

Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK

2-HMR Performance

0

0.2

0.4

0.6

0.8

1

No

rmalized

Perf

orm

an

ce

PWCache SharedTLB Ideal

Additional Baseline Performance

50

45.6%

Documents

MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency