Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
M2 D
C
ICS’02 C-K Luk
Profile-Guided Post-LinkStride Prefetching
C-K Luk, Robert Muth, Harish Patil, Richard Weiss†
P. Geoffrey Lowney, Robert Cohn
Massachusetts Microprocessor Design Center Intel Corporation
†Department of Computer ScienceSmith College
This work was done while the authors were with Compaq
2
M2 D
C
ICS’02 C-K Luk
Software-Based Data PrefetchingSoftware-Based Data Prefetching
A promising method to hide memory latency– Most modern processors support data-prefetch instructions– Rely on compilers to automatically insert prefetches
Compiler-based approach works quite well, but has 2 limitations:– Targets memory references with statically-known strides
• Few in codes with pointers, sparse matrices, malloc()– Needs source code for recompilation
• Source code may not be available– Legacy codes, libraries, commercial codes
• Recompilation may not be desired
3
M2 D
C
ICS’02 C-K Luk
Our ApproachOur Approach
Profile-Guided Post-Link Stride Prefetching– Profiles address strides that are unknown to compilers– Inserts stride prefetches into executables directly
Substantial performance gains on Alpha– 3%-56% speedups in 11 SPEC2000 benchmarks
Incorporated into Compaq Unix tools– Pixie and Spike
4
M2 D
C
ICS’02 C-K Luk
OutlineOutline
Case Studies
Algorithm
Experimental Results
Conclusions
5
M2 D
C
ICS’02 C-K Luk
Address Strides in SPEC2K MCFAddress Strides in SPEC2K MCF
arc* arcin;node* tail;…while (arcin) {
tail = arcin->tail;…arcin = tail->mark;
}
-144B -144B -144B
arcin
tail
Account for 26% of the total stall timeThe list structures are sequentially allocated and remain unchanged then
-96B -96B -96B
6
M2 D
C
ICS’02 C-K Luk
SPEC2K EQUAKESPEC2K EQUAKEfor (i=0; i < nodes; i++) {
Anext = Aindex[i];Alast = Aindex[i + 1];…Anext++;while (Anext < Alast) {
…sum0 += A[Anext][0][0] * v[col][0] +
A[Anext][0][1] * v[col][1] +A[Anext][0][2] * v[col][2];
sum1 += A[Anext][1][0] * v[col][0] +A[Anext][1][1] * v[col][1] +A[Anext][1][2] * v[col][2];
sum2 += A[Anext][2][0] * v[col][0] +A[Anext][2][1] * v[col][1] +A[Anext][2][2] * v[col][2];
…Anext++;
}…
}
70% of total stall time
7
M2 D
C
ICS’02 C-K Luk
Address Strides in EQUAKEAddress Strides in EQUAKE
A[Anext][0][0]
A[Anext][0] A[Anext+1][0]
A[Anext+1][0][0]
A[Anext] A[Anext+1]Compiler’s View:
stride = 128B
A[Anext][0][0] A[Anext+1][0][0]
Memory Layout:
A[Anext][0] A[Anext+1][0]
stride = 128B
8
M2 D
C
ICS’02 C-K Luk
OutlineOutline
Case Studies
Algorithm
Experimental Results
Conclusions
9
M2 D
C
ICS’02 C-K Luk
Algorithm: 3 Major StepsAlgorithm: 3 Major Stepsfoo
Pixie 1. Instrumentation
Training input foo.pixie 2. Stride Profiling
Spike
foo.pf
foo.sprof
3. Prefetch Insertionfoo
10
M2 D
C
ICS’02 C-K Luk
Step 1: InstrumentationStep 1: Instrumentation
Two Issues:
Which loads do not need to be instrumented?– Scalar (i.e. base registers are either $gp or $sp)– Compiler prefetched (i.e. loads with static strides)
Where should instrumentation be placed?– At the points where base registers are defined
11
M2 D
C
ICS’02 C-K Luk
Instrumentation (Example)Instrumentation (Example)Naive Instrumentation
R1 <- R3 + R4 ;
R2 <- R5 + R6;
// R1 and R2 aren’t redefined hereStrideProfile(R1 + 24);
R3 <- load 24(R1);
StrideProfile(R2 + 64);
R4 <- load 64(R2);
StrideProfile(R1 + 48);
R5 <- load 48(R1);
StrideProfile(R2 + 96);
R6 <- load 96(R2);
Optimized InstrumentationR1 <- R3 + R4 ;
R2 <- R5 + R6;
StrideProfile(R1, R2);
// R1 and R2 aren’t redefined hereR3 <- load 24(R1);
R4 <- load 64(R2);
R5 <- load 48(R1);
R6 <- load 96(R2);
Reduce profiling overhead (both code size and runtime)
12
M2 D
C
ICS’02 C-K Luk
Step 2: Stride ProfilingStep 2: Stride ProfilingA stride is recognized if it occurs twice in a row
– Up to 10 strides recorded per load– The average run-length of each stride is also recorded
• # consecutive instances that share the same stride
Complete vs. sampled profiling– Complete: profile the entire execution (accurate but slower)– Sampled: profile part of the execution (faster but less accurate)
• First N instances• Periodic
64B 100B
no stride found
64B 64B
a stride of 64B
13
M2 D
C
ICS’02 C-K Luk
Step 3: Prefetch InsertionStep 3: Prefetch InsertionThree Tasks:
Choosing from Multiple Strides
Computing Prefetching Distances
Prefetch Minimization
14
M2 D
C
ICS’02 C-K Luk
Choosing from Multiple StridesChoosing from Multiple StridesDistribution of strides for each static load:– SPECINT: The most frequent stride happens ~90% of time– SPECFP: The most frequent stride happens ~70% of time
Two possible approaches:– Computing strides at run-time
+ adaptive- expensive (especially when spilling is introduced)
– Selecting the most frequent stride
Inexpensive yet achieves most of the benefits
15
M2 D
C
ICS’02 C-K Luk
Computing Prefetching DistancesComputing Prefetching Distances
Number of loop iterations to fetch ahead
Also take the stride’s run-length R into account– Observation: first D instances are not prefetched
=> If R D, we want a smaller prefetching distance (e.g., )
while (cur) {cur = cur->next;prefetch(cur + stride*D);/* other work to do */
}
×=
Length of the loop body
IPCLatency(cur->next)D
≤2R
16
M2 D
C
ICS’02 C-K Luk
Prefetch MinimizationPrefetch MinimizationGoal:
Remove prefetches of lines that have been prefetched
Example:
R1 <- R1 + 1024;prefetch 16(R1);prefetch 64(R1);if (…) then
prefetch 32(R1); …else
prefetch 1024(R1); …endifprefetch 72(R1);prefetch 118(R1); …
Before Minimization After MinimizationR1 <- R1 + 1024;prefetch 16(R1); //Beginning a 3-line spanprefetch 80(R1); // 1 line from the beginningif (…) then
// prefetch 32(R1) combined into the spanelse
prefetch 1024(R1); …endif// prefetch 72(R1) combined into the spanprefetch 118(R1); …
17
M2 D
C
ICS’02 C-K Luk
OutlineOutline
Case Studies
Algorithm
Experimental Results
Conclusions
18
M2 D
C
ICS’02 C-K Luk
Experimental FrameworkExperimental Framework
Machine: a DS20E Alpha Workstation– 667-MHz 21264 processor– Memory hierarchy:
• 64KB D-cache, 8MB L2-cache, 2GB memory• Miss latencies: 12+ cycles to L2, 80+ cycles to memory
Benchmarks: entire SPEC2000 suite– Profiled on “training”, run on “reference”
Baseline: Compaq C & Fortran Compilers “–O5”– Implemented classic array-based prefetching
19
M2 D
C
ICS’02 C-K Luk
Performance of Stride Prefetching Performance of Stride Prefetching
Substantial speedups:– 3% - 56% speedups in 11 benchmarks– 10% speedups in 8 benchmarks
Only mild slowdowns:– 1% - 3% in 4 benchmarks
||0
|50
|100
No
rmal
ized
Exe
cuti
on
Tim
e 99
ammp
82
applu
100
apsi
99
art
64
equake
89
facerec
85
fma3d
99
galgel
90
lucas
91
mesa
100
mgrid
98
sixtrack
101
swim
97
wupwise
92
AVG
||0
|50
|100
Nor
mal
ized
Exe
cutio
n Ti
me 98
bzip2
100
crafty
101
eon
90
gap
100
gcc
100
gzip
89
mcf
96
parser
102
perlbmk
103
twolf
99
vortex
96
vpr
98
AVG
SPECFP
Nor
mal
ized
Exe
c utio
n Ti
me
(%)
≥
SPECINT
20
M2 D
C
ICS’02 C-K Luk
Where Do the Gains Come From?Where Do the Gains Come From?
Wanted: load misses converted into prefetches– Significant in gap, mcf, applu, equake, facerec, fma3d, lucas
Don’t want too many extra misses– A problem in twolf and facerec
||0
|50
|100
Per
cent
age
of O
rigin
al L
1-L2
Tra
ffic
100
O
110
Sbzip2
100
O
101
Scrafty
100
O
107
Sgap
100
O
107
Smcf
100
O
102
Sparser
100
O
100
Sperlbmk
100
O
132
Stwolf
100
O
108
Svortex
100
O
102
Svpr
||0
|50
|100
Per
cent
age
of O
rigi
nal L
1-L2
Tra
ffic
100
O
108
Sammp
100
O
101
Sapplu
100
O
101
Sapsi
100
O
101
Sart
100
O
101
Sequake
100
O
126
Sfacerec
100
O
106
Sfma3d
100
O
102
Sgalgel
100
O
100
Slucas
100
O
101
Smesa
100
O
100
Smgrid
100
O
100
Sswim
100
O
103
Swupwise
SPECINT
SPECFP
Perc
enta
ge o
f Orig
inal
L1-
L2 M
isse
s PrefetchesStore MissesLoad Misses
21
M2 D
C
ICS’02 C-K Luk
Overhead of Stride ProfilingOverhead of Stride Profiling
||0
|1000
|2000
|3000
No
rma
lize
d P
rofi
lin
g T
ime
(%
) 3043
1015
bzip2
23472551
crafty
1663
941
eon
1696 1782
gap
16001444
gcc
14471772
gzip
545 438
mcf
1544 1495
parser
1633 1729
perlbmk
16321440
twolf
20191770
vortex
1577 1568
vpr
17291495
AVG
SPECINT
SPECFP
Naive Instrumentation (5.5x slowdown)Optimized Instrumentation (4.4x slowdown)
||0
|1000
|2000
|3000
|4000
No
rmal
ized
Pro
filin
g T
ime
(%)
1207
599
ammp
1435
335
applu
1684
830
apsi
934
418
art
996756
equake
1777
1272
facerec
2894
960
fma3d
3046
460
galgel
1248
535
lucas
11261075
mesa
4098
317
mgrid
1934
623
sixtrack
1349
655
swim
1697
789
wupwise
1816
687
AVG
No r
mal
i zed
Pro
filin
g Ti
me
(%)
SPECFP
Profiling Time Expressed as % of the Original Execution Time
Similar overhead as other sw value profilersOptimized instrumentation is quite helpful– Reduces the overhead by two-thirds for SPECFP
22
M2 D
C
ICS’02 C-K Luk
Effectiveness of Sampled ProfilingEffectiveness of Sampled Profiling
||0
|1000
|2000
No
rmalize
d P
rofi
lin
g T
ime (
%)
1015
724
bzip2
2551
1472
crafty
941
619
eon
1782
1071
gap
1444
992
gcc
1772
1145
gzip
438292
mcf
1495
927
parser
1729
1103
perlbmk
1440
846
twolf
1770
1171
vortex
1568
1000
vpr
1495
947
AVG
||0
|500
|1000
No
rma
lize
d P
rofi
lin
g T
ime (
%)
599
384
ammp
335257
applu
830
550
apsi
418328
art
756
510
equake
1272
889
facerec
960
654
fma3d
460330
galgel
535394
lucas
1075
875
mesa
317225
mgrid
623
412
sixtrack
655516
swim
789
497
wupwise
687
487
AVG
Profiling Time Expressed as % of the Original Execution TimeN
orm
ali z
ed P
rofil
ing
Tim
e (%
)
SPECINT
SPECFP
Complete Profiling (“Optimized Instrumentation”)Sampled Profiling
Sampling speeds profiling up by ~50%Almost no performance degradation
23
M2 D
C
ICS’02 C-K Luk
ConclusionsConclusions
Broaden the scope of software prefetching:– Use profiling to detect statically unknown address strides– Apply prefetching without source code and recompilation
Significant performance gains:– 3% - 56% speedups in 11 SPEC2K benchmarks on Alpha
Profiling overhead can be largely reduced:– By instrumentation heuristics, sampling
Incorporated into Compaq Unix tools
24
M2 D
C
ICS’02 C-K Luk
Try it out yourself!
http://www.tru64unix.compaq.com/spike
25
M2 D
C
ICS’02 C-K Luk
Backups
26
M2 D
C
ICS’02 C-K Luk
Distribution of StridesDistribution of StridesLess Frequent Strides
||0
|20
|40
|60
|80
|100
Per
cen
tag
e o
f S
trid
ed L
oad
s
bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr AVG
||0
|20
|40
|60
|80
|100
Per
cen
tag
e o
f S
trid
ed L
oad
s
ammp applu apsi art equakefacerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise AVG
SPECFP
Per c
enta
ge o
f Str i
ded
Load
sThe Most Frequent StrideSPECINT
27
M2 D
C
ICS’02 C-K Luk
Handling Multiple StridesHandling Multiple Strides(a) OriginalR2 <- load 16(R1)
(b) Choosing the Most Frequent Stride// k = most_frequent_stride * prefetching_distance + 16prefetch k(R1);R2 <- load 16(R1);
t <- R1 – t; // t has been holding the previous value of R1. // So, (R1 - t) is the stride
t <- prefetching_distance * t;t <- R1 + t;prefetch 16(t);t <- R1; // save R1 for computing the next strideR2 <- load 16(R1);
(c) Run-time Stride Detection
28
M2 D
C
ICS’02 C-K Luk
DS20E’s Memory HierarchyDS20E’s Memory Hierarchy
2.6 GB/secL2-to-Memory Bandwidth
6.9 GB/secL1-to-L2 Bandwidth
80+ cyclesL1-to-Memory Miss Latency
12+ cyclesL1-to-L2 Miss Latency
8MB, direct-mappedOff-Chip, Unified L2-Cache
32 in-flight loads, 32 in-flight stores,
8 in-flight cache block fills,8 cache victims
Memory Parallelism
64 KB, 2-way set-associativeD-Cache
64 KB, 2-way set-associativeI-Cache
64 bytesLine Size
29
M2 D
C
ICS’02 C-K Luk
Number of Instrumentation SitesNumber of Instrumentation Sites
||0
|20000
|40000
|60000
Nu
mb
er
of
Instr
um
en
tio
n P
oin
ts
66053273
bzip2
122306115
crafty
21227
9188
eon
25211
16413
gap
65122
34963
gcc
66213353
gzip
72103373
mcf
140037308
parser
25420
16956
perlbmk
17721
10245
twolf
17408
8919
vortex
111466141
vpr
||0
|20000
|40000
|60000
|80000
Nu
mb
er
of
Instr
um
en
tio
n P
oin
ts
105994052
ammp
27087
12347
applu
28537
14470
apsi
90894239
art
78823717
equake
26212
12914
facerec
80869
32904
fma3d
43423
16210
galgel
24434
12060
lucas
25150
11594
mesa
23137
11675
mgrid
60519
26459
sixtrack
23055
11967
swim
22430
11511
wupwise
SPECINT
SPECFP
Naive Instrumentation
Optimized Instrumentation
Num
ber o
f Ins
trum
enta
tion
Site
s
Reduced by a half or more in most cases
30
M2 D
C
ICS’02 C-K Luk
Instrumentation Heuristics and Prefetch Minimization Instrumentation Heuristics and Prefetch Minimization
||0
|50
|100
No
rmalize
d E
xecu
tio
n T
ime 98 97 100 98
bzip2
100 100 100 100
crafty
103 103 104 101
eon
90 90 90 90
gap
102 102 101 100
gcc
100 100 99 100
gzip
90 92 90 89
mcf
96 96 96 96
parser
102 102 100 102
perlbmk
103 102 102 103
twolf
100 101 100 99
vortex
96 96 97 96
vpr
98 98 98 98
AVG
||0
|50
|100
No
rma
lize
d E
xe
cu
tio
n T
ime 100101 99 99
ammp
85 85 85 82
applu
100100 99 100
apsi
100100100 99
art
64 64 64 64
equake
91 89 89 89
facerec
85 85 85 85
fma3d
106104102 99
galgel
91 91 91 90
lucas
89 90 89 91
mesa
102102100100
mgrid
96 96 96 98
sixtrack
102102101101
swim
97 97 96 97
wupwise
93 93 93 92
AVG
SPECFP
Nor
mal
ize d
Exe
cutio
n T i
me
SPECINTOnly not-scalar and not-compiler-prefetched
All loads are instrumentedOnly not-scalar loads
+ prefetch minimization
31
M2 D
C
ICS’02 C-K Luk
Performance of Sampled ProfilingPerformance of Sampled Profiling
Almost no performance degradation
||0
|50
|100
No
rma
lize
d E
xe
cu
tio
n T
ime 99 98
ammp
82 83
applu
100 100
apsi
99 99
art
64 64
equake
89 88
facerec
85 85
fma3d
99 100
galgel
90 90
lucas
91 89
mesa
100 100
mgrid
98 97
sixtrack
101 101
swim
97 97
wupwise
92 92
AVG
Nor
mal
ize
||0
|50
|100
No
rma
lize
d E
xe
cu
tio
n T
ime 98 98
bzip2
100 100
crafty
101 101
eon
90 90
gap
100 99
gcc
100 100
gzip
89 92
mcf
96 96
parser
102 101
perlbmk
103 104
twolf
99 99
vortex
96 96
vpr
98 98
AVG
SPECFPd E x
e cu t
ion
T im
eComplete Profiling
Sampled ProfilingSPECINT
32
M2 D
C
ICS’02 C-K Luk
HW vs. SW Stride PrefetchingHW vs. SW Stride PrefetchingHardware Stride Prefetching
– Chen and Baer’s scheme:• Computes strides at runtime using special hardware• Has a table that associates strides with loads
Advantages of HW stride prefetching:– Requires no recompilation or post-link optimizations– No instruction overhead– Adaptive
Advantages of SW stride prefetching:– No limit on how many static loads can be remembered– More programmable (e.g., choosing prefetching distances)– Applicable to most existing machines
33
M2 D
C
ICS’02 C-K Luk
Other Software Stride PrefetchingOther Software Stride PrefetchingCompiler-Inserted Stride Prefetching– Stoutchinin et al’s approach:
• Focuses on prefetching recurrent pointer updates• Does not use profiling
– Wu et al’s approach:• Does stride profiling at the source level• Reduces profiling overhead by code transformation
Post-Link Stride Prefetching– Barnes et al’s approach:
• Uses cache simulation to guide prefetch insertion