Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
The V-Way Cache:
Demand-Based Associativity via Global Replacement
Moinuddin K. Qureshi David Thompson Yale N. Patt
Department of Electrical and Computer Engineering,
The University of Texas at Austin.
{moin, dave, patt}@hps.utexas.edu
The International Symposium on Computer Architecture - 2005 1/23
Introduction
Need for efficient management of secondary caches.
Ideal cache: fully associative with OPT replacement.
The International Symposium on Computer Architecture - 2005 2/23
Introduction
Need for efficient management of secondary caches.
Ideal cache: fully associative with OPT replacement.
0
10
20
30
40
50
Per
cent
age
redu
ctio
n in
mis
s ra
te 256kB 16-way + LRU256kB FA + LRU256kB FA + OPT512kB 8-way + LRU
The International Symposium on Computer Architecture - 2005 2/23
Fully Associative Caches: Cost v/s Benefit
Benefits
Conflict miss eliminationGlobal Replacement (finds the best victim)
Cost
Significant increase in the number of tag comparisonsIncreased access latencyIncreased power consumption
The International Symposium on Computer Architecture - 2005 3/23
Fully Associative Caches: Cost v/s Benefit
Benefits
Conflict miss eliminationGlobal Replacement (finds the best victim)
Cost
Significant increase in the number of tag comparisonsIncreased access latencyIncreased power consumption
Can we get the benefits of a fully associative cache without payingthe cost?
The International Symposium on Computer Architecture - 2005 3/23
Outline
Introduction
Example of Local and Global Replacement
The V-Way Cache
Evaluation
Related Work and Conclusion
The International Symposium on Computer Architecture - 2005 4/23
Example of Local Replacement
X = {x0, x1, x2, x3}
Y = {y0, y1, y2, y3}
SET A
SET B
x0 x1 x2 x3
y0 y1 y2 y3
WORKING SETdx0
dx1
dx2
dx3
dy0
dy1
dy2
dy3
DATA−STORETAG−STOREADDRESS
The International Symposium on Computer Architecture - 2005 5/23
Example of Local Replacement
SET A
SET B
x0 x1 x2 x3
y0 y1 y2 y3
WORKING SETdx0
dx1
dx2
dx3
dy0
dy1
dy2
dy3
DATA−STORETAG−STOREADDRESS
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
The International Symposium on Computer Architecture - 2005 5/23
Example of Local Replacement
SET A
SET B
x1 x2 x3
y0 y1 y2 y3
WORKING SET
dx1
dx2
dx3
dy0
dy1
dy2
dy3
DATA−STORETAG−STOREADDRESS
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
x4
dx4
The International Symposium on Computer Architecture - 2005 5/23
Example of Local Replacement
SET A
SET B
x1 x2 x3
y0 y1 y2 y3
WORKING SET
dx1
dx2
dx3
dy0
dy1
dy2
dy3
DATA−STORETAG−STOREADDRESS
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
x4
dx4
DORMANT WAY
THRASH
The International Symposium on Computer Architecture - 2005 5/23
Example of Local Replacement
SET A
SET B
x1 x2 x3
y0 y1 y2 y3
WORKING SET
dx1
dx2
dx3
dy0
dy1
dy2
dy3
DATA−STORETAG−STOREADDRESS
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
x4
dx4
DORMANT WAY
THRASH
Static partitioning of resources.
The International Symposium on Computer Architecture - 2005 5/23
Example of Global Replacement
Y = {y0, y1, y2, y3}
X = {x0, x1, x2, x3}
SET B1
SET A1
SET B0
SET A0 x0 x2
x1 x3
y0 y2
y1 y3
dy0
dx3
dx2
dy3
dy1
dx1
dy2
ADDRESS
dx0
DATA−STORETAG−STORE
WORKING SET
REDISTRIBUTED
The International Symposium on Computer Architecture - 2005 6/23
Example of Global Replacement
SET B1
SET A1
SET B0
SET A0 x0 x2
x1 x3
y0 y2
y1 y3
dy0
dx3
dx2
dy3
dy1
dx1
dy2
ADDRESS
dx0
DATA−STORETAG−STORE
WORKING SET
REDISTRIBUTED
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
The International Symposium on Computer Architecture - 2005 6/23
Example of Global Replacement
SET B1
SET A1
SET B0
SET A0 x0 x2
x1 x3
y0 y2
y1 y3
dy0
dx3
dx2
dy3
dy1
dx1
dy2
ADDRESS
dx0
DATA−STORETAG−STORE
WORKING SET
REDISTRIBUTED
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
The International Symposium on Computer Architecture - 2005 6/23
Example of Global Replacement
SET B1
SET A1
SET B0
SET A0 x0 x2
x1 x3
y0 y2
y1
dy0
dx3
dx2
dy1
dx1
dy2
ADDRESS
dx0
DATA−STORETAG−STORE
WORKING SET
REDISTRIBUTED
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
x4
dx4
The International Symposium on Computer Architecture - 2005 6/23
Example of Global Replacement
SET B1
SET A1
SET B0
SET A0 x0 x2
x1 x3
y0 y2
y1
dy0
dx3
dx2
dy1
dx1
dy2
ADDRESS
dx0
DATA−STORETAG−STORE
WORKING SET
REDISTRIBUTED
X = {x0, x1, x2, x3, x4}
Y = {y0, y1, y2}
x4
dx4
Dynamic sharing of resources!!
The International Symposium on Computer Architecture - 2005 6/23
Outline
Introduction
Example of Local and Global Replacement
The V-Way Cache
Evaluation
Related Work and Conclusion
The International Symposium on Computer Architecture - 2005 7/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
HIT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
HIT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBAL
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
DATA
HIT
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
MISS
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
MISS
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
MISS
Configuration Tag Access Data Replacement
Set-Associative Fast Local
Fully-Associative
V-Way
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
MISS
Configuration Tag Access Data Replacement
Set-Associative Fast Local
Fully-Associative Slow Global
V-Way
The International Symposium on Computer Architecture - 2005 8/23
The V-Way Cache
TAGCOMPARE
FPTRSELECT
RPTRV
ARRAY
SCHEME
GLOBALDATA
DATA
FPTRTAGSTATUS
REPLACEMENT
INDEX OFFTAG
TAG−STORE DATA−STORE
MISS
Configuration Tag Access Data Replacement
Set-Associative Fast Local
Fully-Associative Slow Global
V-Way Fast Global
The International Symposium on Computer Architecture - 2005 8/23
A Practical Global Replacement Algorithm
LRU is impractical because there are thousands of lines
Second level cache access stream is a filtered version of theprogram access stream
Reuse frequency is skewed towards the low end0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15+
Reuse count
0
10
20
30
40
Per
cent
age
of L
2 ev
icti
ons
The International Symposium on Computer Architecture - 2005 9/23
Reuse Replacement
00 01 10 11
VICTIMIZE
INITIALIZE
TESTTESTTESTTEST
TABLE
PTR
REUSE COUNTER
ACCESS ACCESS ACCESS
ACCESS
The International Symposium on Computer Architecture - 2005 10/23
Reuse Replacement
00 01 10 11
VICTIMIZE
INITIALIZE
TESTTESTTESTTEST
ACCESS ACCESS ACCESS
ACCESS
TABLE
PTR
REUSE COUNTER
11
01
00
The International Symposium on Computer Architecture - 2005 10/23
Reuse Replacement
00 01 10 11
VICTIMIZE
INITIALIZE
TESTTESTTESTTEST
TABLEREUSE COUNTER
PTR
10
01
00ACCESS ACCESS ACCESS
ACCESS
The International Symposium on Computer Architecture - 2005 10/23
Reuse Replacement
00 01 10 11
VICTIMIZE
INITIALIZE
TESTTESTTESTTEST
ACCESS ACCESS
ACCESS
TABLEREUSE COUNTER
PTR
10
00
00ACCESS
The International Symposium on Computer Architecture - 2005 10/23
Reuse Replacement
00 01 10 11
VICTIMIZE
INITIALIZE
TESTTESTTESTTEST
ACCESS
TABLEREUSE COUNTER
PTR
10
00
00ACCESS ACCESS
ACCESS
The International Symposium on Computer Architecture - 2005 10/23
Victim Distance for Reuse Replacement
Problem of variable replacement latencyAverage victim distance: 3.9Worst case victim distance: 1888
The International Symposium on Computer Architecture - 2005 11/23
Victim Distance for Reuse Replacement
Problem of variable replacement latencyAverage victim distance: 3.9Worst case victim distance: 1888
SolutionTest eight counters each cycleLimit search to five cycles
1 2 3 4 5
98.9%Probability (victim)
Latency (in cycles)
91.3% 96.9% 99.2%98.3%
The International Symposium on Computer Architecture - 2005 11/23
Outline
Introduction
Example of Local and Global Replacement
The V-Way Cache
Evaluation
Related Work and Conclusion
The International Symposium on Computer Architecture - 2005 12/23
Evaluation Outline
Experimental Methodology
Reduction in Misses with the V-Way Cache
Comparing Reuse Replacement and LRU
Storage, Latency, and Energy Cost
Impact on IPC
The International Symposium on Computer Architecture - 2005 13/23
Experimental Methodology
First level I-cache, D-cache: 16kB, 2-way, 64B linesize, LRU
Baseline L2: Unified, 256kB, 8-way, 128B linesize, LRU
Benchmarks: SPEC CPU2000
The International Symposium on Computer Architecture - 2005 14/23
Reduction in Misses with the V-Way Cache
Primary upper bound: Fully associative cache
Secondary upper bound: Double sized cache
The International Symposium on Computer Architecture - 2005 15/23
Reduction in Misses with the V-Way Cache
Primary upper bound: Fully associative cache
Secondary upper bound: Double sized cache
0
10
20
30
40
50
60
70
80
90
100
Per
cent
age
redu
ctio
n in
mis
s ra
te 256kB V-way + Reuse
256kB FA + LRU512kB 8-way + LRU
perlb
mkcra
ftymes
afac
erec
vorte
xap
simcf vp
r
gcc
ammp
twolf
bzip2
parse
rgz
ipsw
imga
lgel
amea
n
The International Symposium on Computer Architecture - 2005 15/23
Comparing Reuse Replacement and LRU
Comparison of miss-rate for LRU and Reuse replacement
Bmk bzip2 crafty gcc gzip mcf parser perl. twolf vortex
LRU 34.6 1.1 3.8 2.4 29.5 32.7 0.1 36.5 8.5
Reuse 35.0 1.0 3.8 2.4 29.9 32.9 0.1 35.4 7.1
Bmk vpr ammp apsi facerec galgel mesa swim amean
LRU 11.0 50.0 34.8 50.7 8.3 3.4 65.3 23.3
Reuse 10.5 50.0 34.8 50.6 8.5 3.5 65.3 23.2
The International Symposium on Computer Architecture - 2005 16/23
Storage, Latency, and Energy Cost
Storage needed for extra tags, FPTR, RPTR, and Reuse bits
Line-size Miss-rate reduction Increase in area
128 B 13.2% 5.8%
256 B 14.9% 2.9%
Delay due to more tags and FPTR selection: 0.13 ns
Energy in accessing bigger tag-store
Parallel lookup Baseline V-Way
1.02nJ 0.35nJ 0.40nJ
The International Symposium on Computer Architecture - 2005 17/23
Storage, Latency, and Energy Cost
Storage needed for extra tags, FPTR, RPTR, and Reuse bits
Line-size Miss-rate reduction Increase in area
128 B 13.2% 5.8%
256 B 14.9% 2.9%
Delay due to more tags and FPTR selection: 0.13 ns
Energy in accessing bigger tag-store
Parallel lookup Baseline V-Way
1.02nJ 0.35nJ 0.40nJ
The International Symposium on Computer Architecture - 2005 17/23
Storage, Latency, and Energy Cost
Storage needed for extra tags, FPTR, RPTR, and Reuse bits
Line-size Miss-rate reduction Increase in area
128 B 13.2% 5.8%
256 B 14.9% 2.9%
Delay due to more tags and FPTR selection: 0.13 ns
Energy in accessing bigger tag-store
Parallel lookup Baseline V-Way
1.02nJ 0.35nJ 0.40nJ
The International Symposium on Computer Architecture - 2005 17/23
Impact on IPC
Pipeline: 12 stage, 8 wide with 128 entry reservation station
L1 hit latency of 2 cycles and L2 hit latency of 10 cycles
L3/Main memory: access-latency of 80 cycles
The International Symposium on Computer Architecture - 2005 18/23
Impact on IPC
Pipeline: 12 stage, 8 wide with 128 entry reservation station
L1 hit latency of 2 cycles and L2 hit latency of 10 cycles
L3/Main memory: access-latency of 80 cycles
0
5
10
15
20
25
30
35
40
45
Per
cent
age
IPC
impr
ovem
ent
over
bas
e
V-Way (same hit latency as base)V-Way (additional one cycle hit latency)
bzip2
crafty gcc
gzip
mcfpa
rser
perlb
mktw
olfvo
rtex
vpr
ammp
apsi
facere
cga
lgel
mesa
swim
gmea
n
The International Symposium on Computer Architecture - 2005 18/23
Outline
Introduction
Example of Local and Global Replacement
The V-Way Cache
Evaluation
Related Work and Conclusion
The International Symposium on Computer Architecture - 2005 19/23
Related Work
Extra storage for conflict misses: Victim cache [Jouppi ISCA’90]
Multi-probe techniquesPredictive sequential associative cache [Calder+ HPCA’96]
Adaptive group associative cache [Peir+ ASPLOS’98]
Cache indexing functionSkewed associativity [Seznec ISCA’93]
Prime-modulo indexing [Kharbutli+ HPCA’04]
Software managed fully associative cache: IIC [Hallnor+ ISCA’00]
The International Symposium on Computer Architecture - 2005 20/23
Other Possible Applications of the V-Way Cache
Platform for global replacement with inbuilt shadow directory
Tag inclusion data exclusion [Piranha ISCA’00]
Cache compression [Hallnor+ HPCA’05]
Interaction with NuRAPID [Chishti+ MICRO’03]
The International Symposium on Computer Architecture - 2005 21/23
Conclusion
Traditional cache assumes uniform accesses across sets
Global replacement allows the V-Way cache to vary thenumber of valid ways depending on the set demand
Reuse replacement is fast and performs comparable to LRU
V-Way cache can lower miss-rate and improve performance
V-Way cache can serve as an infrastructure for otheroptimizations
The International Symposium on Computer Architecture - 2005 22/23
Questions
The International Symposium on Computer Architecture - 2005 23/23