42
Using Dead Blocks as a Virtual Victim Cache Samira Khan, Daniel A. Jiménez, Doug Burger, Babak Falsafi

Using Dead Blocks as a Virtual Victim Cache

Embed Size (px)

DESCRIPTION

Using Dead Blocks as a Virtual Victim Cache. Samira Khan, Daniel A. Jiménez , Doug Burger, Babak Falsafi. The Cache Utilization Wall. Performance gap Processors getting faster Memory only getting larger Caches are not efficient Designed for fast lookup Contain too many useless blocks!. - PowerPoint PPT Presentation

Citation preview

Using Dead Blocks as a Virtual Victim Cache

Samira Khan, Daniel A. Jiménez, Doug Burger, Babak Falsafi

The Cache Utilization Wall Performance gap

Processors getting faster Memory only getting larger

Caches are not efficient Designed for fast lookup Contain too many useless blocks!

We want the cache to be as efficient as possible

Cache Problem: Dead Blocks

Live Block will be

referenced again before eviction

Dead Block from the last

reference until evicted

fill hithithit last hit eviction

live dead

Cache set

MRU

LRU

Cache blocks are dead on average 59% of the time

Reducing Dead Blocks: Virtual Victim Cache

CacheMRU LRU

Dead blocks all over the cache acts as a victim cache

Live block Dead block Victim block

Put victim blocks in the dead blocks

Contribution: Virtual Victim CacheContribution: Skewed dead block predictor Victim placement and lookup

Result: Improves predictor accuracy by 4.7% Reduces miss rate by 26% Improves performance by 12.1%

IntroductionVirtual Victim CacheMethodologyResultsConclusion

Virtual Victim Cache

Goal: use dead blocks to hold the victim blocks

Mechanism Required:1. Identify which block is dead2. Lookup the victims

Different Dead Block Predictors Counting Based [ICCD05]

Predicts dead after certain number of accesses Time Based [ISCA02]

Predicts dead after certain number of cycles Trace Based [ISCA01]

Predicts the last touch based on PC Cache Burst Based [MICRO08]

Predicts when block moves out of the MRU

Trace-Based Dead Block Predictor [ISCA 01]

Predicts last touch based on sequence of instructions

Encoding: truncated addition of instruction PCs Called signature

Predictor table is indexed by signature 2 bit saturating counters

Trace-Based Dead Block Predictor [ISCA 01]

PC2 : st b

PC4 : st a

PC5 : ld a

PC6 : ld e

PC7 : ld f

PC8 : st a

signature =<PC1,PC3,PC4,PC5,PC8>

fill hithithit last hit eviction

live dead

PC3 : ld a

PC1 : ld a fill

hit

hithit

hit, last touch

Predictor table

1 dead

PC

seq

uence

Skewed Trace Predictor

confidence

Index = hash(signature) Index1 = hash1(signature)

Index2 = hash2(signature)

conf1

conf2

dead if confidence >= threshold dead if conf1+conf2 >= threshold

Reference trace predictor table Skewed trace predictor table

Skewed Trace Predictor

Uses two different hash functions

Reduces conflict Improves

accuracy

sigX

sigY

conflict

Index1=hash1(sigY)

Index1=hash1(sigX)

Index2=hash2(sigX)

Index3=hash2(sigY)

Predictor tables

Conflict in both tables is less likely

Victim Placement and Lookup in VVC Place victims in dead blocks of adjacent

sets Any victim can be placed in any set

Have to lookup each set for a hit Trade off between

number of sets lookup latency

We use only one adjacent set to minimize lookup latency

How to determine adjacent set? Set that differ by only 1 bit Far enough not to be a hot set

CacheMRU LRU

Original set

Adjacent set

Victim Lookup On a miss search the adjacent set If found, bring it back to its original

set

CacheMRU LRU

Original set

Adjacent set

SearchOrigina

l set

miss

Searchadjacent set

hit

Move to

original set

Virtual Victim Cache: Why it Works? Reduces Conflict Misses

Provides extra associativity to the hot set

Reduces Capacity Misses Puts the LRU block in a dead block Fully associative cache would have replaced the

LRU block Increasing live blocks effectively increases

capacity

Robust to False Positive Prediction VVC will find that block in the adjacent set, avoids

the miss

IntroductionVirtual Victim CacheMethodologyResultsConclusion

Experimental Methodology

Parameter ConfigurationIssue WidthL1 I CacheL1 D CacheL2 CacheMain MemoryCoresTrace EncodingPredictor Table EntriesPredictor Entry

464KB, 2-way LRU, 64B blocks, 1 cycle hit64KB, 2-way LRU, 64B blocks, 3 cycle hit2MB, 16-way LRU, 64B blocks, 12 cycle hit270 cycle415 bits327682 bits

Simulator: Modified version of Simplescalar Benchmark: Spec CPU2000 and spec CPU2006

Single Thread Speedup

0.950000000000002

1

1.05

1.1

1.15

1.2

fully associative cache, LRUbaseline + 64KB victim cache

Sp

eed

up

2.6

1.3

1.7

1.7

1.3

0.9

Fully associative cache and 64KB victim cache both are unrealistic design

Single Thread Speedup

0.9

0.95

1

1.05

1.1

1.15

1.2

dynamic insertion pol-icy

Sp

eed

up

1.2

1.6

2.6

1.4

1.7

The accuracy of the predictor is more important in dead block replacement

Speedup for Multiple Threads

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3baseline + 64KB victim cachedynamic insertion policydead block replacement

Sp

eed

up

0.88 0.88 0.89 0.84

Blocks become less predictable in presence of multiple threads

Tag Array Reads due to VVC

175.

vpr

178.

galg

el

179.

art

181.

mcf

187.

face

rec

188.

amm

p

197.

pars

er

255.

vorte

x

256.

bzip

2

300.

twol

f

401.

bzip

2

429.

mcf

450.

sopl

ex

456.

hmm

er

464.

h264

ref

473.

asta

r

Amea

n0

5

10

15

20

25

30

35

40

45

Baseline 16 way 2MB, LRU

Ta

g A

rra

y R

ea

ds (

10

M)

Tag array reads in the baseline cache is 3.9% of the total number of the instructions executed , versus

4.9% for the VVC

Conclusion Skewed predictor improves accuracy by 4.7%

Virtual Victim Cache achieves 12.1% speedup for single-threaded workloads 4% speedup for multiple-threaded workloads

Future Work in Dead Block Prediction Improve accuracy Reduce overhead

Thank you

Extra slides

Dead blocks as a Virtual Victim Cache Placing victim blocks in to adjacent set

Evicted blocks are placed in invalid/predicted dead block of the adjacent set

If no such block is present victim blocks are placed in the LRU block

Then the receiver block is moved to the MRU position Adaptive insertion is also used

Cache lookup for previously evicted block original set lookup : miss adjacent set lookup : hit Block is refilled from the adjacent to original set Receiver block in the adjacent set is marked as invalid One bit keeps track of receiver blocks Tag match in original accesses ignores the receiver blocks

Reduction in Cache Area

0.3

0.4

0.5

0.6

0.7

0.8 Baseline CacheVirtual Victim Cache

Cache Capacity

Harm

on

ic M

ean

IP

C

Predictor Coverage and False Positive Rate

0

10

20

30

40

50

60Skewed predictor cov-erage

Skewed predictor false positive

Perc

en

tag

e o

f L2

A

ccess

175.vpr

178.galg

el

179.art

181.mcf

187.face

rec

188.am

mp

197.par

ser

255.vor

tex

256.bzip

2

300.twolf

401.bzip

2

429.mcf

450.soplex

456.hm

mer

464.h264re

f

473.ast

ar

Amea

n

pc m : ld apc n : ld a

pc o : st a

pc p : ld b

pc q : ld c

pc r : st d

pc s : ld e

pc t : ld f

pc u : ld g

signature tag & data

Memory instruction sequencegoing to cache set sFill action

hit action

Update the predictor

<signature m> 0

Update signature

m a

pc v : ld h

pc w : ld i

hit action

evict action

<signature m+n>

<signature m+n+o>

0

1

m+nm+n+om+n+o

a

m+n+o

a

m+n+o a

m+n+o a

m+n+o

a

m+n+o a

m+n+o a

Trace Based Dead Block Predictor

MPKI

175.

vpr

178.

galg

el

179.

art

181.

mcf

187.

face

rec

188.

amm

p

197.

pars

er

255.

vorte

x

256.

bzip

2

300.

twol

f

401.

bzip

2

429.

mcf

450.

sopl

ex

456.

hmm

er

464.

h264

ref

473.

asta

r

Arithm

eticM

ean

0

10

20

30

40

50

60baseline 16-way 2MB LRU cachefully associative cache, LRUbaseline + 64KB victim cachedynamic insertion policydead block replacementvirtual victim cachefully associative, optimal replacement

IPC

175.

vpr

178.

galg

el

179.

art

181.

mcf

187.

face

rec

188.

amm

p

197.

pars

er

255.

vorte

x

256.

bzip

2

300.

twol

f

401.

bzip

2

429.

mcf

450.

sopl

ex

456.

hmm

er

464.

h264

ref

473.

asta

r0

0.5

1

1.5

2

2.5

3 baseline 16-way 2MB LRU cachefully associative cache, LRUbaseline + 64KB victim cache

Speedup

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

fully associative cache, LRUbaseline + 64KB victim cachedynamic insertion policydead block replacementvirtual victim cache

2.5

2.62.6

Motivation

Cache

False Positive Prediction

0

5

10

15

20

25P

erc

en

tag

e F

als

e P

osi

tive

pre

dic

tion

Shared cache contention results in more false positive predictions

Predictor Table Hardware Budget

128B

512B

2KB

8KB

32KB

128KB

3

3.5

4

4.5

5 Lai at el PredictorSkewed Predictor

Predictor Table Size

Perc

en

tag

e F

als

e P

osit

ive

Pre

dic

tion

s

With 8KB predictor, VVC achieves 5.4% speedup with original predictor where it achieves 12.1% speedup with

skewed predictor

Cache Efficiency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8baseline 2MB LRU cache

L2 C

ach

e E

ffici

en

cy

VVC improves cache efficiency by 62% for multiple-threaded workloads and by 26% for single-threaded workloads

IntroductionBackgroundVirtual Victim CacheMethodologyResultsConclusion

IntroductionBackgroundVirtual Victim CacheMethodologyResultsConclusion

Experimental Methodology

Parameter Configuration

Trace encoding 16

Predictor table entries 32768

Predictor entry 2 bit

Predictor overhead 8KB

Cache overhead 64KB

Total overhead 76KB

Dead Block Predictor parameter

Overhead is 3.4% of the total 2MB L2 cache space

Reducing Dead Blocks: Virtual Victim Cache

CacheMRU LRU

Dead blocks all over the cache acts as a victim cache

Virtual Victim Cache Place evicted blocks in dead blocks

of other adjacent sets On a miss search the other adjacent

sets for a match If that block is found in adjacent set,

bring it back to its original set

Dead blocks across all over the cache acts as a victim cache

Virtual Victim Cache: How it Works? How to determine adjacent set?

Set that differ by only 1 bit, in our case 4th bit

Far enough not to be a hot set How to find receiver block in the

adjacent set? Add 1 bit to receiver block

Where to place the receiver block? Use dynamic insertion policy Choose either LRU or MRU position