Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Not Quite My Tempo1
Matching Prefetches to Memory Access Times
1. This work was inspired by Chazelle, Simmons, and Teller’s foundational work in this area: Whiplash.1
Mark SutherlandAjaykumar Kannan
Natalie Enright Jerger
{suther68,kannanaj,enright}@ece.utoronto.ca
2
3
4
Goal: Add temporal information to an SMS prefetcher.
Motivation
5
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"400.P
ERLBENCH"
401.B
ZIP2"
403.G
CC"
410.B
WAVES"
416.G
AMESS"
429.M
CF"
433.M
ILC"
434.Z
EUSM
P"435.G
ROMACS"
436.C
ACTUSADM
"437.L
ESLIE3D"
444.N
AMD"
450.S
OPLEX"
453.P
OVRAY"
454.C
ALCULIX"
456.H
MMER"
458.S
JENG"
462.L
IBQU
ANTUM"
464.H
264REF"
470.L
BM"
471.O
MNETPP"
473.A
STAR"
Frac%o
n(of(Prefetche
s(
Useless(Prefetches(in(SMS(
0%
Our Contributions
1. Propose a framework for timely prefetching in SMS, which utilizes temporal information to filter prefetches in the Pattern History Table.
2. Demonstrate practical improvements over SMS with a 32kB storage budget.
SMS Prefetcher Background
Source: S. Somogyi, T. F.Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial Memory Streaming. SIGARCH Comput. Archit. News, 2006.
7
SMS Prefetcher BackgroundRepresentation of a Memory Region
Page Bits Accessed Cache Blocks = 1100110100 … 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1110100101 … 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0100000010 … 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
• Memory regions (most often the size of a physical page) are bit vectors, where a set bit indicates that the program accessed this block.
8
SMS Prefetcher Background
9
Core & L1
L2 Cache
Main Memory
1
SMS
Active Generation Table
Pattern History Table
Observe MissesPC+ Offset Cache Blocks
PC1+2 0 0 1 0 0 0 0 0PC1: ld A+2
SMS Prefetcher Background
10
Core & L1
L2 Cache
Main Memory
1
SMS
Active Generation Table
Pattern History Table
Observe MissesPC+ Offset Cache Blocks
PC1+2 0 0 1 0 0 1 0 0PC1: ld A+5
SMS Prefetcher Background
11
Core & L1
L2 Cache
Main Memory
1
SMS
Active Generation Table
Pattern History Table
Observe MissesPC+ Offset Cache Blocks
PC1+2 0 0 1 0 0 1 0 0PC2+4 0 0 0 0 1 0 0 0
PC2: ld B+4
SMS Prefetcher Background
12
Core & L1
L2 Cache
Main Memory
1
SMS
Active Generation Table
Pattern History Table
Observe MissesPC+ Offset Cache Blocks
PC1+2 1 0 1 0 0 1 0 0PC2+4 0 0 0 0 1 0 0 0
PC1: ld A
SMS Prefetcher Background
13
Core & L1
L2 Cache
Main Memory
SMS
Active Generation Table
Pattern History Table
PC+ Offset Cache Blocks
PC2+4 0 0 0 0 1 0 0 0
L2: evict A
2 Eviction Transfers EntryTo PHT
PC+ Offset Cache Blocks
PC1+2 1 0 1 0 0 1 0 0
SMS Prefetcher Background
14
Core & L1
L2 Cache
Main Memory
SMS
Active Generation Table
Pattern History Table
PC+ Offset Cache Blocks
PC2+4 0 0 0 0 1 0 0 0
PC1: ld C+2
PC+ Offset Cache Blocks
PC1+2 1 0 1 0 0 1 0 0
3Issue
Prefetches
{C, C+5}
SMS Prefetcher Background
15
Core & L1
L2 Cache
Main Memory
SMS
Active Generation Table
Pattern History Table
3Issue
Prefetches
{p. block}
1 Observe Misses
2 Eviction Transfers EntryTo PHT
Adding Temporal Information
16
• Problem: In the bit vectors stored by SMS, we do not have any information about the order of the accesses.
• Goal: A mechanism to distinguish various accesses within a spatial region, preferably one that allows us to prefetch blocks in a more timely fashion.
Adding Temporal Information
17
t
• Observe that many of our applications see the same PC generating memory requests with different time deltas between each request.
6 31 82
H H M H H M H
3 25 69
Adding Temporal Information
18
• Add a global counter to the L2 that ticks on every cache miss. • Define access tempo: Delta between two cache accesses.
L2 CacheCounter: 000100
L2 CacheCounter: 000101
Observe CacheMiss?
Adding Temporal Information
19
• Decompose all PHT entries into multiple access tempos.• Store each disjoint tempo in its own “slice” of SMS.
PHT:%0001111001011110100101101"
T1:%0001100001000000000100101" T2:%000011000011110100001000"
…..
Training Based on Tempo
20
• Store the access tempos in on-chip Localized Tempo Buffers (LTBs). • During training, only set the bits of the vector corresponding to the
currently measured tempo.
4 2 5
12 8 9
1 3 1 PC Hash
LTB Age
T0:00000010010" T2:%0100110010"…
AGT Slices
Prefetching Based on Tempo
21
• Simply knowing our current access tempo is not enough to ensure prefetch timeliness.
• Define memory tempo: Delta between when a memory request is issued, and when the block is filled into the cache.
L2 CacheCounter: 000100
MSHR1
MSHRN
.
.
.
Creq 1Store Counter
SnapshotMain Memory
Prefetching Based on Tempo
22
• Simply knowing our current access tempo is not enough to ensure prefetch timeliness.
• Define memory tempo: Delta between when a memory request is issued, and when the block is filled into the cache.
L2 CacheCounter: 0010001
MSHR1
MSHRN
.
.
.
Creq 2
Calculate MemoryTempo
Main Memory
Rushing or Dragging?
23
• Can compare the memory’s tempo against the current PC’s access tempo at prefetch time.
PC > Memory(Rushing)
Memory > PC(Dragging)
• Generating requests faster than memory is returning them.
• Prefetch only slower tempos.
• Memory is fast enough to keep up with our requests.
• Prefetch faster tempos.
Walkthrough Example
24
Tempo&Decoder&PC: Load [B+2]
6
LTB Age
8 4 6
1 2
10
GTB Age
12 9 14 14 9 10 12
3
T2:00000010010" T3:&0100110010"AGT Slices
…. 4
Prefetch “Slow & V. Slow” 5
1. PC Observes Demand Request to Address [B+2]
Walkthrough Example
25
Tempo&Decoder&PC: Load [B+2]
6
LTB Age
8 4 6
1 2
10
GTB Age
12 9 14 14 9 10 12
3
T2:00000010010" T3:&0100110010"AGT Slices
…. 4
Prefetch “Slow & V. Slow” 5
2. Lookup Current Local Access Tempo (6)
Walkthrough Example
26
Tempo&Decoder&PC: Load [B+2]
6
LTB Age
8 4 6
1 2
10
GTB Age
12 9 14 14 9 10 12
3
T2:00000010010" T3:&0100110010"AGT Slices
…. 4
Prefetch “Slow & V. Slow” 5
3. Average All Elements of Global Tempo Buffer (12)
Walkthrough Example
27
Tempo&Decoder&PC: Load [B+2]
6
LTB Age
8 4 6
1 2
10
GTB Age
12 9 14 14 9 10 12
3
T2:00000010010" T3:&0100110010"AGT Slices
…. 4
Prefetch “Slow & V. Slow” 5
4. Decode “Rushing” or “Dragging”.
Walkthrough Example
28
Tempo&Decoder&PC: Load [B+2]
6
LTB Age
8 4 6
1 2
10
GTB Age
12 9 14 14 9 10 12
3
T2:00000010010" T3:&0100110010"AGT Slices
…. 4
Prefetch “Slow & V. Slow” 5
5. Prefetch Matching Tempos
Evaluation
29
Core (OoO) Memory
6-wide, 256 entry instruction window 16kB L1, 128kB L2, Fully inclusive L3
No fetch hazards. 32 entry L2 request queue
Issue 2 loads & 1 store / cycle. Open Row FR-FCFS Mem. Scheduling
tl2 = 10 cycles
Config 1: 1MB L3, 12.8 GB/s memory
Config 2: 256kB L3, 12.8 GB/s memory
Config 3: 1 MB L3, 3.2 GB/s memory
30
Normal Configuration
Absolute IPC Results
0"
0.02"
0.04"
0.06"
0.08"
0.1"
0.12"
0.14"
0.16"
0.18"
Baseline" SMS" Tempo"
IPC$for$mcf$
mcf"
20%
Absolute IPC Results
31
Increased Low B/W Performance
32
• Decreasing memory B/W increases Tempo’s gains.
Reduction in Useless Prefetches
33
Better
Conclusion
34
1. Presented a “sliced” version of SMS that filters prefetching decisions based on repeated PC access times.
2. Achieved 1.45% and 2.57% IPC improvement on Normal and Low Bandwidth configurations.
3. Reduced the number of useless prefetches by 17.6%.
35
?