View
25
Download
0
Category
Tags:
Preview:
DESCRIPTION
Using Dead Blocks as a Virtual Victim Cache. Samira Khan, Daniel A. Jiménez , Doug Burger, Babak Falsafi. The Cache Utilization Wall. Performance gap Processors getting faster Memory only getting larger Caches are not efficient Designed for fast lookup Contain too many useless blocks!. - PowerPoint PPT Presentation
Citation preview
Using Dead Blocks as a Virtual Victim Cache
Samira Khan, Daniel A. Jiménez, Doug Burger, Babak Falsafi
The Cache Utilization Wall Performance gap
Processors getting faster Memory only getting larger
Caches are not efficient Designed for fast lookup Contain too many useless blocks!
We want the cache to be as efficient as possible
Cache Problem: Dead Blocks
Live Block will be
referenced again before eviction
Dead Block from the last
reference until evicted
fill hithithit last hit eviction
live dead
Cache set
MRU
LRU
Cache blocks are dead on average 59% of the time
Reducing Dead Blocks: Virtual Victim Cache
CacheMRU LRU
Dead blocks all over the cache acts as a victim cache
Live block Dead block Victim block
Put victim blocks in the dead blocks
Contribution: Virtual Victim CacheContribution: Skewed dead block predictor Victim placement and lookup
Result: Improves predictor accuracy by 4.7% Reduces miss rate by 26% Improves performance by 12.1%
Virtual Victim Cache
Goal: use dead blocks to hold the victim blocks
Mechanism Required:1. Identify which block is dead2. Lookup the victims
Different Dead Block Predictors Counting Based [ICCD05]
Predicts dead after certain number of accesses Time Based [ISCA02]
Predicts dead after certain number of cycles Trace Based [ISCA01]
Predicts the last touch based on PC Cache Burst Based [MICRO08]
Predicts when block moves out of the MRU
Trace-Based Dead Block Predictor [ISCA 01]
Predicts last touch based on sequence of instructions
Encoding: truncated addition of instruction PCs Called signature
Predictor table is indexed by signature 2 bit saturating counters
Trace-Based Dead Block Predictor [ISCA 01]
PC2 : st b
PC4 : st a
PC5 : ld a
PC6 : ld e
PC7 : ld f
PC8 : st a
signature =<PC1,PC3,PC4,PC5,PC8>
fill hithithit last hit eviction
live dead
PC3 : ld a
PC1 : ld a fill
hit
hithit
hit, last touch
Predictor table
1 dead
PC
seq
uence
Skewed Trace Predictor
confidence
Index = hash(signature) Index1 = hash1(signature)
Index2 = hash2(signature)
conf1
conf2
dead if confidence >= threshold dead if conf1+conf2 >= threshold
Reference trace predictor table Skewed trace predictor table
Skewed Trace Predictor
Uses two different hash functions
Reduces conflict Improves
accuracy
sigX
sigY
conflict
Index1=hash1(sigY)
Index1=hash1(sigX)
Index2=hash2(sigX)
Index3=hash2(sigY)
Predictor tables
Conflict in both tables is less likely
Victim Placement and Lookup in VVC Place victims in dead blocks of adjacent
sets Any victim can be placed in any set
Have to lookup each set for a hit Trade off between
number of sets lookup latency
We use only one adjacent set to minimize lookup latency
How to determine adjacent set? Set that differ by only 1 bit Far enough not to be a hot set
CacheMRU LRU
Original set
Adjacent set
Victim Lookup On a miss search the adjacent set If found, bring it back to its original
set
CacheMRU LRU
Original set
Adjacent set
SearchOrigina
l set
miss
Searchadjacent set
hit
Move to
original set
Virtual Victim Cache: Why it Works? Reduces Conflict Misses
Provides extra associativity to the hot set
Reduces Capacity Misses Puts the LRU block in a dead block Fully associative cache would have replaced the
LRU block Increasing live blocks effectively increases
capacity
Robust to False Positive Prediction VVC will find that block in the adjacent set, avoids
the miss
Experimental Methodology
Parameter ConfigurationIssue WidthL1 I CacheL1 D CacheL2 CacheMain MemoryCoresTrace EncodingPredictor Table EntriesPredictor Entry
464KB, 2-way LRU, 64B blocks, 1 cycle hit64KB, 2-way LRU, 64B blocks, 3 cycle hit2MB, 16-way LRU, 64B blocks, 12 cycle hit270 cycle415 bits327682 bits
Simulator: Modified version of Simplescalar Benchmark: Spec CPU2000 and spec CPU2006
Single Thread Speedup
0.950000000000002
1
1.05
1.1
1.15
1.2
fully associative cache, LRUbaseline + 64KB victim cache
Sp
eed
up
2.6
1.3
1.7
1.7
1.3
0.9
Fully associative cache and 64KB victim cache both are unrealistic design
Single Thread Speedup
0.9
0.95
1
1.05
1.1
1.15
1.2
dynamic insertion pol-icy
Sp
eed
up
1.2
1.6
2.6
1.4
1.7
The accuracy of the predictor is more important in dead block replacement
Speedup for Multiple Threads
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3baseline + 64KB victim cachedynamic insertion policydead block replacement
Sp
eed
up
0.88 0.88 0.89 0.84
Blocks become less predictable in presence of multiple threads
Tag Array Reads due to VVC
175.
vpr
178.
galg
el
179.
art
181.
mcf
187.
face
rec
188.
amm
p
197.
pars
er
255.
vorte
x
256.
bzip
2
300.
twol
f
401.
bzip
2
429.
mcf
450.
sopl
ex
456.
hmm
er
464.
h264
ref
473.
asta
r
Amea
n0
5
10
15
20
25
30
35
40
45
Baseline 16 way 2MB, LRU
Ta
g A
rra
y R
ea
ds (
10
M)
Tag array reads in the baseline cache is 3.9% of the total number of the instructions executed , versus
4.9% for the VVC
Conclusion Skewed predictor improves accuracy by 4.7%
Virtual Victim Cache achieves 12.1% speedup for single-threaded workloads 4% speedup for multiple-threaded workloads
Future Work in Dead Block Prediction Improve accuracy Reduce overhead
Dead blocks as a Virtual Victim Cache Placing victim blocks in to adjacent set
Evicted blocks are placed in invalid/predicted dead block of the adjacent set
If no such block is present victim blocks are placed in the LRU block
Then the receiver block is moved to the MRU position Adaptive insertion is also used
Cache lookup for previously evicted block original set lookup : miss adjacent set lookup : hit Block is refilled from the adjacent to original set Receiver block in the adjacent set is marked as invalid One bit keeps track of receiver blocks Tag match in original accesses ignores the receiver blocks
Reduction in Cache Area
0.3
0.4
0.5
0.6
0.7
0.8 Baseline CacheVirtual Victim Cache
Cache Capacity
Harm
on
ic M
ean
IP
C
Predictor Coverage and False Positive Rate
0
10
20
30
40
50
60Skewed predictor cov-erage
Skewed predictor false positive
Perc
en
tag
e o
f L2
A
ccess
175.vpr
178.galg
el
179.art
181.mcf
187.face
rec
188.am
mp
197.par
ser
255.vor
tex
256.bzip
2
300.twolf
401.bzip
2
429.mcf
450.soplex
456.hm
mer
464.h264re
f
473.ast
ar
Amea
n
pc m : ld apc n : ld a
pc o : st a
pc p : ld b
pc q : ld c
pc r : st d
pc s : ld e
pc t : ld f
pc u : ld g
signature tag & data
Memory instruction sequencegoing to cache set sFill action
hit action
Update the predictor
<signature m> 0
Update signature
m a
pc v : ld h
pc w : ld i
hit action
evict action
<signature m+n>
<signature m+n+o>
0
1
m+nm+n+om+n+o
a
m+n+o
a
m+n+o a
m+n+o a
m+n+o
a
m+n+o a
m+n+o a
Trace Based Dead Block Predictor
MPKI
175.
vpr
178.
galg
el
179.
art
181.
mcf
187.
face
rec
188.
amm
p
197.
pars
er
255.
vorte
x
256.
bzip
2
300.
twol
f
401.
bzip
2
429.
mcf
450.
sopl
ex
456.
hmm
er
464.
h264
ref
473.
asta
r
Arithm
eticM
ean
0
10
20
30
40
50
60baseline 16-way 2MB LRU cachefully associative cache, LRUbaseline + 64KB victim cachedynamic insertion policydead block replacementvirtual victim cachefully associative, optimal replacement
IPC
175.
vpr
178.
galg
el
179.
art
181.
mcf
187.
face
rec
188.
amm
p
197.
pars
er
255.
vorte
x
256.
bzip
2
300.
twol
f
401.
bzip
2
429.
mcf
450.
sopl
ex
456.
hmm
er
464.
h264
ref
473.
asta
r0
0.5
1
1.5
2
2.5
3 baseline 16-way 2MB LRU cachefully associative cache, LRUbaseline + 64KB victim cache
Speedup
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
fully associative cache, LRUbaseline + 64KB victim cachedynamic insertion policydead block replacementvirtual victim cache
2.5
2.62.6
False Positive Prediction
0
5
10
15
20
25P
erc
en
tag
e F
als
e P
osi
tive
pre
dic
tion
Shared cache contention results in more false positive predictions
Predictor Table Hardware Budget
128B
512B
2KB
8KB
32KB
128KB
3
3.5
4
4.5
5 Lai at el PredictorSkewed Predictor
Predictor Table Size
Perc
en
tag
e F
als
e P
osit
ive
Pre
dic
tion
s
With 8KB predictor, VVC achieves 5.4% speedup with original predictor where it achieves 12.1% speedup with
skewed predictor
Cache Efficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8baseline 2MB LRU cache
L2 C
ach
e E
ffici
en
cy
VVC improves cache efficiency by 62% for multiple-threaded workloads and by 26% for single-threaded workloads
Experimental Methodology
Parameter Configuration
Trace encoding 16
Predictor table entries 32768
Predictor entry 2 bit
Predictor overhead 8KB
Cache overhead 64KB
Total overhead 76KB
Dead Block Predictor parameter
Overhead is 3.4% of the total 2MB L2 cache space
Reducing Dead Blocks: Virtual Victim Cache
CacheMRU LRU
Dead blocks all over the cache acts as a victim cache
Virtual Victim Cache Place evicted blocks in dead blocks
of other adjacent sets On a miss search the other adjacent
sets for a match If that block is found in adjacent set,
bring it back to its original set
Dead blocks across all over the cache acts as a victim cache
Virtual Victim Cache: How it Works? How to determine adjacent set?
Set that differ by only 1 bit, in our case 4th bit
Far enough not to be a hot set How to find receiver block in the
adjacent set? Add 1 bit to receiver block
Where to place the receiver block? Use dynamic insertion policy Choose either LRU or MRU position
Recommended