Upload
leanh
View
221
Download
2
Embed Size (px)
Citation preview
CS698Y: Modern Memory Systems Lecture-11 (Hardware Prefetching)
Biswabandan Panda [email protected]
https://www.cse.iitk.ac.in/users/biswap/CS698Y.html
Modern Memory Systems Biswabandan Panda, CSE@IITK 2
Flow of the Module
Data Prefetching Techniques
Interaction with Cache Replacement
Metrics Related to Prefetching
Instruction Prefetching
But, Why Prefetching? Remember Memory Wall: It is still hurting
Modern Memory Systems Biswabandan Panda, CSE@IITK 3
Hardware Prefetching
What? Latency-hiding technique - Fetches data before the core demands.
Why? Off-chip DRAM latency has grown up to 400 to 800 cycles.
How? By observing/predicting the demand access (LOAD/STORE) patterns.
Modern Memory Systems Biswabandan Panda, CSE@IITK 4
Hardware Prefetch Engine
L2
Prefetcher
X+2 X+3
Co
re
X+3
❶
❷
❸
❹
❺ HIT
X+1 X
Modern Memory Systems Biswabandan Panda, CSE@IITK 5
Prefetchers in Multicore - 101
Interconnect
L3
L2 PF
Core 0 Core 1 Core 2 Core 3
L2 L2 L2 PF PF PF
Modern Memory Systems Biswabandan Panda, CSE@IITK 6
Prefetching Knobs
Prefetch Degree: Number of prefetch requests to issue at a given time.
L2
L3/DRAM
Prefetcher
X
Demand Access
X+1
X+2
X+1 X+2
X+1 X+3 X+4
Modern Memory Systems Biswabandan Panda, CSE@IITK 7
Prefetching Knobs
Prefetch Distance: How far ahead of the demand access stream are the prefetch requests issued?
demand access
Prefetch-distance
X Y
prefetch
Y = X + 4 Y = X + 8 Y = X + 16
Modern Memory Systems Biswabandan Panda, CSE@IITK 8
Aggressiveness [degree, distance]
Prefetch degree: #Prefetch requests issued on a miss
X+1 X+1 X+2 X+2 X+1 X+3 X+4 PF
Prefetch distance: How far ahead (in terms of # blocks) of the demand miss ?
Demand Miss
X Y
Prefetch
Modern Memory Systems Biswabandan Panda, CSE@IITK 9
The Simplest Prefetcher
Next Line: Miss to cache block X , prefetch X+1. Degree=1, Distance=1
Works well for L1 Icache and L1 Dcache.
Next N Line: Miss to cache block X , prefetch X+1, X+2, ….. X+N, Degree=N, Distance= min. 1 and max. N
Modern Memory Systems Biswabandan Panda, CSE@IITK 10
What about this?
Modern Memory Systems Biswabandan Panda, CSE@IITK 11
Stride Prefetching
PC effective address
instruction tag previous address stride state
-
+
prefetch address
Modern Memory Systems Biswabandan Panda, CSE@IITK 12
An Example
float a[100][100], b[100][100], c[100][100]; ... for (i = 0; i < 100; i++) for (j = 0; j < 100; j++) for (k = 0; k < 100; k++) a[i][j] += b[i][k] * c[k][j];
instruction tag previous address stride state ld b[i][k] 50000 0 initial ld c[k][j] 90000 0 initial ld a[i][j] 10000 0 initial
ld b[i][k] 50004 4 trans ld c[k][j] 90400 400 trans ld a[i][j] 10000 0 steady
ld b[i][k] 50008 4 steady ld c[k][j] 90800 800 steady ld a[i][j] 10000 0 steady
Modern Memory Systems Biswabandan Panda, CSE@IITK 13
Pointer Chasers
Modern Memory Systems Biswabandan Panda, CSE@IITK 14
Stream Prefetching [DPC1]
1st miss
2nd miss
3rd miss
100
100 102 104
102 104 Trained!
miss sequence 503 504 501 499
503 504 501 Fail! 504 501 499 Trained!
1st miss
2nd
miss 3rd
miss
Training: Consecutive misses in the same direction.
Modern Memory Systems Biswabandan Panda, CSE@IITK 15
Stream Prefetcher in Action
prefetch degree
Stream direction
original addr
memory access
Monitored region
prefetch distance
start addr end addr
Modern Memory Systems Biswabandan Panda, CSE@IITK 16
Stream + Stride
Stream direction
original addr
memory access
start addr
Monitored region
prefetch distance * stride
end addr
prefetch degree * stride
Modern Memory Systems Biswabandan Panda, CSE@IITK 17
Quantifying Prefetchers
(i)Prefetch
(i)Prefetchcuracy(i)PrefetchAc
issued
hits
(i)Prefetch
(i)Prefetch)Lateness(i
hits
late
(i)Demand
Poll(i) LLCi)Pollution(
misses
Prefetched Block in the Cache.
Prefetched Block Still on its way
Prefetched Block evicted a demand block that will be reused
(i)Demand (i) HitsPrefetch
Hits(i)Prefetch )Coverage(i
misses
Fraction of misses avoided
Modern Memory Systems Biswabandan Panda, CSE@IITK 18
Prefetch Lateness
Cache Miss
Cache
X
Prefetch
Demand
X
Modern Memory Systems Biswabandan Panda, CSE@IITK 19
Cache Pollution
Cache Miss
Cache
X Y
A
Z
B Set 1
Set 2
C
X
Prefetch
Demand
Modern Memory Systems Biswabandan Panda, CSE@IITK 20
Cache Hits & Accuracy
Cache Hit
Cache
Z Y
A B Set 1
Set 2
Z
Prefetch
Demand
Modern Memory Systems Biswabandan Panda, CSE@IITK 21
Where to Put These Prefetchers?
L1? Next-line, PC-localized stride predictors
L2? Stream + Stride, Other variants
L1 instruction cache ? Predict the future PC
Modern Memory Systems Biswabandan Panda, CSE@IITK 22
State-of-the-art Prefetchers
Modern Memory Systems Biswabandan Panda, CSE@IITK 23
Perfect Timing
Modern Memory Systems Biswabandan Panda, CSE@IITK 24
Delayed Prefetching
Modern Memory Systems Biswabandan Panda, CSE@IITK 25
Offset
Modern Memory Systems Biswabandan Panda, CSE@IITK 26
Offset = Sum of strides
Modern Memory Systems Biswabandan Panda, CSE@IITK 27
milc: Offset
Modern Memory Systems Biswabandan Panda, CSE@IITK 28
GemsFDTD: Offset
Modern Memory Systems Biswabandan Panda, CSE@IITK 29
Best-offset Prefetcher [HPCA ‘16]
Modern Memory Systems Biswabandan Panda, CSE@IITK 30
Specialized Streams Temporal Streams – Sequences of temporally correlated addresses, exploited by TMS [ISCA ‘05].
Spatial Streams – Streams, which are correlated in space, exploited by SMS [ISCA ‘06].
SpatioTemporal Streams – Temporal correlation among the spatial regions, and spatial correlation within a region, exploited by STeMS [ISCA ‘09].
Modern Memory Systems Biswabandan Panda, CSE@IITK 31
Spatial Memory Streaming (SMS)
Filter Table (FT)
Tag PC/Offset
Miss to A+1 PC/1 A
Miss to A+3 1
Bit Vector
Accumulation Table (AT)
PC/Offset Tag
A PC/1 0101
Miss to A+2 2 A PC/1 0111
Eviction/ Invalidation A
3
Active Generation Table (AGT)
Sig Bit Vector
Pattern History Table (PHT)
0111 PC/1
.
.
.
.
. .
.
.
- Divides the memory space into fixed size regions, indexed by a signature (PC/offset) . - Each signature contains a bit vector. - Each bit in the bit vector corresponds to a cache line.
Modern Memory Systems Biswabandan Panda, CSE@IITK 32
Reading Assignment-1
Proactive Instruction Fetch [MICRO ‘11]
Indirect Memory Prefetcher [MICRO ‘15]
Deadline: October 7, 2017, 17:00 hrs through Canvas
More details through Piazza
Modern Memory Systems Biswabandan Panda, CSE@IITK 33
Programming Assignment-2
Will be released on Sept 11, 2017
Based on Hardware Prefetchers
This time: You have to code (no analysis)
Modern Memory Systems Biswabandan Panda, CSE@IITK 34
PA1 Presentation
12th Sept, 15:00 hrs IST, KD-103
7+1 min presentation
Do not put MPKI and IPC numbers
You will evaluate your peers
Modern Memory Systems Biswabandan Panda, CSE@IITK 35
What About Irregular Applications? [MICRO ‘13]
Modern Memory Systems Biswabandan Panda, CSE@IITK 36
PC Localization
Modern Memory Systems Biswabandan Panda, CSE@IITK 37
Structural Address Space
Modern Memory Systems Biswabandan Panda, CSE@IITK 38
Modern Memory Systems Biswabandan Panda, CSE@IITK 39
Modern Memory Systems Biswabandan Panda, CSE@IITK 40
Modern Memory Systems Biswabandan Panda, CSE@IITK 41
Modern Memory Systems Biswabandan Panda, CSE@IITK 42
Irregular Stream Buffer [MICRO ‘13]
Modern Memory Systems Biswabandan Panda, CSE@IITK 43
Modern Memory Systems Biswabandan Panda, CSE@IITK 44
Interaction with Cache Replacement
Read PACMan [MICRO ‘11]
Crux: Prefetched blocks are not reused after their first-use. So insert them with lowest priority.