A Hardware-based Cache A Hardware-based Cache Pollution Filtering Pollution Filtering Mechanism for Mechanism for Aggressive PrefetchesAggressive Prefetches
Georgia Institute of TechnologyGeorgia Institute of TechnologyAtlanta, GA 30332Atlanta, GA 30332
ICPP, Kaohsiung, Taiwan, 2003
Xiaotong ZhuangXiaotong Zhuang Hsien-Hsin Sean
LeeCollege of ComputingCollege of Computing School of Electrical andSchool of Electrical andComputer EngineeringComputer Engineering
2ICPP-03
AgendaAgenda
IntroductionIntroductionMotivationThe Prefetch Pollution FilterExperimental ResultsConclusion
3ICPP-03
AgendaAgenda
IntroductionIntroductionMotivationThe Prefetch Pollution FilterExperimental ResultsConclusion
4ICPP-03
Data PrefetchingData PrefetchingWhyWhy data prefetching data prefetching??
Speed gap between CPU and main memory Initial data references still miss Performance suffers if no enough independent instructions
to mask the latencyPrefetching techniquesPrefetching techniques
Hardware-based Software-based
Design Trend Design Trend Memory bandwidth increase more aggressive prefetch L1 cache is getting smaller for expediting accesses
When When prefetchingprefetching becomes “ becomes “tootoo aggressive”aggressive” Severe pollution Performance overkill
5ICPP-03
Cache PollutionCache PollutionSource of pollutionSource of pollution
No prefetching guarantees 100% accuracy HW-based prefetching can cause a lot of pollution Stride-based prefetching can easily become ineffective for
pointer-based applications
OutcomesOutcomes of pollution of pollution Evict useful data Compete for available resources
Limited size of cache capacity Cache ports Bus bandwidth between components of memory hiearchy
Degrade performance
6ICPP-03
Related WorkRelated WorkPrefetch bufferPrefetch buffer [Chen et al. ‘91] [Chen & Baer ‘95]
Separate normal and prefetched data, access in parallel Small-size, fully-associative, in critical path
Evict-meEvict-me [Wang et al. ’02]
Reuse distance check, mark unused or distance too long Evict-me data have higher priority to be cast out
Dead cache line detection [Lai, Fide & Falsafi ’01]
Detect dead blocks and replace with useful prefetches Prevent useful data from being evicted
Prefetch taxonomy [Srinivasan et al. ‘99]
More detailed classification of prefetches Proposed “static filter”—profiling based pollution filtering
7ICPP-03
Our ContributionOur ContributionCharacterization of prefetch effectivenessPropose and evaluate two hardware prefetch
pollution filtering mechanisms Per-Address (PA) based Program Counter (PC) based
Quantify our technique through simulation
8ICPP-03
AgendaAgenda
IntroductionMotivationMotivationThe Prefetch Pollution FilterExperimental ResultsConclusion
9ICPP-03
Prefetch ClassificationPrefetch Classification
Prefetch classification Comprehensive classification is not desirable due
to its implementation complexity in hardware Good or effective— those referenced in the cache
before they are evicted Bad or ineffective — those never referenced
during their lifetime in the cache
10ICPP-03
Prefetch EffectivenessPrefetch Effectiveness
11 benchmarks, HW prefetch—NSP, SDP, SW prefetchMore than 52% prefetches are bad!!
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1good prefetch bad prefetch
Norm
aliz
ed #
of
Pre
fetc
hes
11ICPP-03
AgendaAgenda
IntroductionMotivationThe Prefetch Pollution FilterThe Prefetch Pollution FilterExperimental ResultsConclusion
12ICPP-03
Cache Pollution FilterCache Pollution Filter
OOO Core
L1 Cache
LD
/ST
Q
ueue
L2 Cache
HardwarePrefetcher
Pre
fetc
h Q
ueue
Issu
e P
refe
tch
SW
Pre
fetc
hes
look
up
Prefetch Pollution FilterPrefetch Pollution Filter
History Tablearray of 2-bit counters
Hash
Upd
ate
Ld/st inst includ.SW prefetches
TAG
Reference Indication Bit (RIB)
Pre
fetc
h In
dica
tion
bit
(P
IB)
DATA
13ICPP-03
Prefetch Pollution FiltersPrefetch Pollution FiltersPA-basedPA-based
Per-Address-based, track cache line addresses issued by each prefetch operation
Can distinguish different prefetch addresses by the same issuing instruction
Need longer history table to reduce aliasing
PC-basedPC-based Track the program counter that triggers a prefetch SW prefetch: PC of the prefetch instruction HW pretetch: the memory instruction that triggers the
prefetch Less aliasing, tolerate smaller history table, less precise
14ICPP-03
AgendaAgenda
IntroductionMotivationThe Prefetch Pollution FilterExperimental ResultsConclusion
15ICPP-03
SimulationSimulation Configuration Configuration (Default)(Default)
Processor Caches
Target frequency 2GHz L1 I/D 8K, 32-byte lineDM, 1 cycle
Issue/retire width 8 per cycle
Reorder bufer 128 entries L1 D ports 3
Load/store queue 64 entries L2 I/D 512K 32-byte line4 way 15 cycle delay
Branch Predictor Bimodal with 2048 entries L2 I/D ports 1
BTB size 4096 sets, assoc=4 Prefetcher
Memory Queue Len 64 entries
Latency 150 core cycles Pollution Filter
Bus 64 byte wide Hist table 1KB, 4K entries
16ICPP-03
BenchmarksBenchmarks and Miss Rates and Miss Rates
Benchmarks Input data sets L1 miss rate L2 miss rate
bh 2048 bodies 0.0464 0.0026
em3d 100 nodes 10 arity 10K iter 0.2161 0.0001
perimeter 12 Levels 0.0478 0.2709
ijpeg penguin.ppm 0.0565 0.0235
fpppp natoms.in 0.0807 0.0003
Gcc cp-decl.i 0.0551 0.0221
Wave5 wave5.in 0.1387 0.0209
Gap ref.in 0.0409 0.2247
Gzip input .graphic 0.0597 0.3176
Mcf inp.in 0.0648 0.2426
17ICPP-03
Prefetch Reduction Prefetch Reduction Comparison Comparison ((Default Default ModelModel))
Normalized to the good one without filteringLoss of bad prefetches: 97%(PA) 98%(PC)Loss of good prefetches: 51%(PA) 48%(PC)Traffic reduction: 75%(PA) 74%(PC)
Norm
aliz
ed #
of
Pre
fetc
hes
0
0.5
1
1.5
2
2.5
3
3.5
bad(no filtering) bad(PA) bad(PC) good(no filtering) good(PA) good(PC)
18ICPP-03
IPC IPC Comparison Comparison (Default Model)(Default Model)
Increase: 8.2%(PA) 9.1%(PC)
0
0.5
1
1.5
2
2.5
3
3.5no-filtering PA-based PC-based
IPC
19ICPP-03
Prefetch Reduction Prefetch Reduction Comparison Comparison Comparison Comparison (32KB)(32KB)
00.20.40.60.8
11.21.41.6
bad(no filtering) bad(PA) bad(PC) good(no filtering) good(PA) good(PC)
Loss of bad prefetches: 91%(PA) 92%(PC)Loss of good prefetches: 35%(PA) 27%(PC)Traffic reduction: 52%(PA) 47%(PC)
20ICPP-03
IPC IPC Comparison Comparison (32K Cache(32K Cache Model Model))
Increase: 7.0%(PA) 8.1%(PC)
0
0.5
1
1.5
2
2.5
3
3.5
4no-f iltering PA-based PC-based
IPC
21ICPP-03
IPC for Different History Table IPC for Different History Table SizesSizes
Jump at 2k-4k, 6% <1% before & after
0
0.5
1
1.5
2
2.5
3
3.51K 2K 4K 8K 16K
IPC
22ICPP-03
Bad/Good Prefetch Ratio for Bad/Good Prefetch Ratio for DDifferent ifferent ## of L1 Ports of L1 Ports
6% drop from 3-port to 4-port, 2% drop from 4-port to 5-port
Bad/G
ood P
refe
tch R
ati
o
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
13-port 4-port 5-port
23ICPP-03
IPC for IPC for DifferentDifferent ## of L1 of L1 PortsPorts
4% speedup from 3-port to 4-port, <1% speedup from 4-port to 5-port
0
0.5
1
1.5
2
2.5
3
3.53-port 4-port 5-port
IPC
24ICPP-03
Bad/Good Prefetch Ratio wBad/Good Prefetch Ratio w// Prefetch BufferPrefetch Buffer
Prefbuf, on critical path, very smallPrefbuf, no reduction in traffic, short lifetime for good prefetch
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6PA-based(no prefbuf) PA-based(prefbuf)
PC-based(no prefbuf) PC-based(prefbuf)
25ICPP-03
IPC Comparison wIPC Comparison w// Prefetch Prefetch BufferBuffer
IPC Loss: 9% (PA) 10%(PC)
0
0.5
1
1.5
2
2.5
3
3.5
4PA-based(no prefbuf) PA-based(prefbuf)
PC-based(no prefbuf) PC-based(prefbuf)
IPC
26ICPP-03
AgendaAgenda
IntroductionMotivationThe Prefetch Pollution FilterExperimental ResultsConclusionConclusion
27ICPP-03
Conclusion Conclusion Too aggressive prefetching is an overkillLots of prefetches are ineffective
Cannot remove SW-induced prefetches without source code Have to live with HW-induced prefetches Need dynamic HW-based prefetch filtering schemes
We propose (1) Per-Address-based and (2) Program-Counter-based that can Filter out ~98% bad prefetches for 8KB L1 Filter out ~92% bad prefetches for 32KB L1 Most good prefetches are retained ~50%(8K L1) ~70%(32K L1)
Improvement Traffic reduced by ~75%(8K L1) ~50%(32K L1) Overall IPC improved by 7% to 9%
History table size can be reasonably smallImprovements decrease when more cache ports are addedIPC loses (9-10 %) with dedicated prefetch buffer for
aggressive prefetching
28ICPP-03
That’s All Folks !That’s All Folks !Thanks Archbeer!Thanks Archbeer!
29ICPP-03
Bad/Good Prefetch Bad/Good Prefetch Ratio Comparison Ratio Comparison ((Default ModelDefault Model))
Reduction: 70%(PA) 91%(PC)
0
0.5
1
1.5
2
2.5
3
3.5 no-filtering PA-based PC-based
Bad/G
ood P
refe
tch R
ati
o
30ICPP-03
Bad/Good Prefetch Bad/Good Prefetch Ratio Comparison Ratio Comparison (32KB)(32KB)
Reduction: 75%(PA) 93%(PC)
Bad
/Good
Pre
fetc
h R
ati
o
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6no-f iltering PA-based PC-based