Profile-Guided Post-Link Stride Prefetching · facerec. 2894 960. fma3d. 3046 460. galgel. 1248 535. lucas. 11261075. mesa. 4098 317. mgrid. 1934 623. sixtrack. 1349 655. swim. 1697

1

M2 D

C

ICS’02 C-K Luk

Profile-Guided Post-LinkStride Prefetching

C-K Luk, Robert Muth, Harish Patil, Richard Weiss†

P. Geoffrey Lowney, Robert Cohn

Massachusetts Microprocessor Design Center Intel Corporation

†Department of Computer ScienceSmith College

This work was done while the authors were with Compaq

2

M2 D

C

ICS’02 C-K Luk

Software-Based Data PrefetchingSoftware-Based Data Prefetching

A promising method to hide memory latency– Most modern processors support data-prefetch instructions– Rely on compilers to automatically insert prefetches

Compiler-based approach works quite well, but has 2 limitations:– Targets memory references with statically-known strides

• Few in codes with pointers, sparse matrices, malloc()– Needs source code for recompilation

• Source code may not be available– Legacy codes, libraries, commercial codes

• Recompilation may not be desired

3

M2 D

C

ICS’02 C-K Luk

Our ApproachOur Approach

Profile-Guided Post-Link Stride Prefetching– Profiles address strides that are unknown to compilers– Inserts stride prefetches into executables directly

Substantial performance gains on Alpha– 3%-56% speedups in 11 SPEC2000 benchmarks

Incorporated into Compaq Unix tools– Pixie and Spike

4

M2 D

C

ICS’02 C-K Luk

OutlineOutline

Case Studies

Algorithm

Experimental Results

Conclusions

5

M2 D

C

ICS’02 C-K Luk

Address Strides in SPEC2K MCFAddress Strides in SPEC2K MCF

arc* arcin;node* tail;…while (arcin) {

tail = arcin->tail;…arcin = tail->mark;

}

-144B -144B -144B

arcin

tail

Account for 26% of the total stall timeThe list structures are sequentially allocated and remain unchanged then

-96B -96B -96B

6

M2 D

C

ICS’02 C-K Luk

SPEC2K EQUAKESPEC2K EQUAKEfor (i=0; i < nodes; i++) {

Anext = Aindex[i];Alast = Aindex[i + 1];…Anext++;while (Anext < Alast) {

…sum0 += A[Anext][0][0] * v[col][0] +

A[Anext][0][1] * v[col][1] +A[Anext][0][2] * v[col][2];

sum1 += A[Anext][1][0] * v[col][0] +A[Anext][1][1] * v[col][1] +A[Anext][1][2] * v[col][2];

sum2 += A[Anext][2][0] * v[col][0] +A[Anext][2][1] * v[col][1] +A[Anext][2][2] * v[col][2];

…Anext++;

}…

}

70% of total stall time

7

M2 D

C

ICS’02 C-K Luk

Address Strides in EQUAKEAddress Strides in EQUAKE

A[Anext][0][0]

A[Anext][0] A[Anext+1][0]

A[Anext+1][0][0]

A[Anext] A[Anext+1]Compiler’s View:

stride = 128B

A[Anext][0][0] A[Anext+1][0][0]

Memory Layout:

A[Anext][0] A[Anext+1][0]

stride = 128B

8

M2 D

C

ICS’02 C-K Luk

OutlineOutline

Case Studies

Algorithm


Conclusions

9

M2 D

C

ICS’02 C-K Luk

Algorithm: 3 Major StepsAlgorithm: 3 Major Stepsfoo

Pixie 1. Instrumentation

Training input foo.pixie 2. Stride Profiling

Spike

foo.pf

foo.sprof

3. Prefetch Insertionfoo

10

M2 D

C

ICS’02 C-K Luk

Step 1: InstrumentationStep 1: Instrumentation

Two Issues:

Which loads do not need to be instrumented?– Scalar (i.e. base registers are either $gp or $sp)– Compiler prefetched (i.e. loads with static strides)

Where should instrumentation be placed?– At the points where base registers are defined

11

M2 D

C

ICS’02 C-K Luk

Instrumentation (Example)Instrumentation (Example)Naive Instrumentation

R1 <- R3 + R4 ;

R2 <- R5 + R6;

// R1 and R2 aren’t redefined hereStrideProfile(R1 + 24);

R3 <- load 24(R1);

StrideProfile(R2 + 64);

R4 <- load 64(R2);


R5 <- load 48(R1);


R6 <- load 96(R2);

Optimized InstrumentationR1 <- R3 + R4 ;

R2 <- R5 + R6;

StrideProfile(R1, R2);

// R1 and R2 aren’t redefined hereR3 <- load 24(R1);

R4 <- load 64(R2);

R5 <- load 48(R1);

R6 <- load 96(R2);

Reduce profiling overhead (both code size and runtime)

12

M2 D

C

ICS’02 C-K Luk

Step 2: Stride ProfilingStep 2: Stride ProfilingA stride is recognized if it occurs twice in a row

– Up to 10 strides recorded per load– The average run-length of each stride is also recorded

• # consecutive instances that share the same stride

Complete vs. sampled profiling– Complete: profile the entire execution (accurate but slower)– Sampled: profile part of the execution (faster but less accurate)

• First N instances• Periodic

64B 100B

no stride found

64B 64B

a stride of 64B

13

M2 D

C

ICS’02 C-K Luk

Step 3: Prefetch InsertionStep 3: Prefetch InsertionThree Tasks:

Choosing from Multiple Strides

Computing Prefetching Distances

Prefetch Minimization

14

M2 D

C

ICS’02 C-K Luk

Choosing from Multiple StridesChoosing from Multiple StridesDistribution of strides for each static load:– SPECINT: The most frequent stride happens ~90% of time– SPECFP: The most frequent stride happens ~70% of time

Two possible approaches:– Computing strides at run-time

+ adaptive- expensive (especially when spilling is introduced)

– Selecting the most frequent stride

Inexpensive yet achieves most of the benefits

15

M2 D

C

ICS’02 C-K Luk

Computing Prefetching DistancesComputing Prefetching Distances

Number of loop iterations to fetch ahead

Also take the stride’s run-length R into account– Observation: first D instances are not prefetched

=> If R D, we want a smaller prefetching distance (e.g., )

while (cur) {cur = cur->next;prefetch(cur + stride*D);/* other work to do */

}

×=

Length of the loop body

IPCLatency(cur->next)D

≤2R

16

M2 D

C

ICS’02 C-K Luk

Prefetch MinimizationPrefetch MinimizationGoal:

Remove prefetches of lines that have been prefetched

Example:

R1 <- R1 + 1024;prefetch 16(R1);prefetch 64(R1);if (…) then

prefetch 32(R1); …else

prefetch 1024(R1); …endifprefetch 72(R1);prefetch 118(R1); …

Before Minimization After MinimizationR1 <- R1 + 1024;prefetch 16(R1); //Beginning a 3-line spanprefetch 80(R1); // 1 line from the beginningif (…) then

// prefetch 32(R1) combined into the spanelse

prefetch 1024(R1); …endif// prefetch 72(R1) combined into the spanprefetch 118(R1); …

17

M2 D

C

ICS’02 C-K Luk

OutlineOutline

Case Studies

Algorithm


Conclusions

18

M2 D

C

ICS’02 C-K Luk

Experimental FrameworkExperimental Framework

Machine: a DS20E Alpha Workstation– 667-MHz 21264 processor– Memory hierarchy:

• 64KB D-cache, 8MB L2-cache, 2GB memory• Miss latencies: 12+ cycles to L2, 80+ cycles to memory

Benchmarks: entire SPEC2000 suite– Profiled on “training”, run on “reference”

Baseline: Compaq C & Fortran Compilers “–O5”– Implemented classic array-based prefetching

19

M2 D

C

ICS’02 C-K Luk

Performance of Stride Prefetching Performance of Stride Prefetching

Substantial speedups:– 3% - 56% speedups in 11 benchmarks– 10% speedups in 8 benchmarks

Only mild slowdowns:– 1% - 3% in 4 benchmarks

||0

|50

|100

No

rmal

ized

Exe

cuti

on

Tim

e 99

ammp

82

applu

100

apsi

99

art

64

equake

89

facerec

85

fma3d

99

galgel

90

lucas

91

mesa

100

mgrid

98

sixtrack

101

swim

97

wupwise

92

AVG

||0

|50

|100

Nor

mal

ized

Exe

cutio

n Ti

me 98

bzip2

100

crafty

101

eon

90

gap

100

gcc

100

gzip

89

mcf

96

parser

102

perlbmk

103

twolf

99

vortex

96

vpr

98

AVG

SPECFP

Nor

mal

ized

Exe

c utio

n Ti

me

(%)

≥

SPECINT

20

M2 D

C

ICS’02 C-K Luk

Where Do the Gains Come From?Where Do the Gains Come From?

Wanted: load misses converted into prefetches– Significant in gap, mcf, applu, equake, facerec, fma3d, lucas

Don’t want too many extra misses– A problem in twolf and facerec

||0

|50

|100

Per

cent

age

of O

rigin

al L

1-L2

Tra

ffic

100

O

110

Sbzip2

100

O

101

Scrafty

100

O

107

Sgap

100

O

107

Smcf

100

O

102

Sparser

100

O

100

Sperlbmk

100

O

132

Stwolf

100

O

108

Svortex

100

O

102

Svpr

||0

|50

|100

Per

cent

age

of O

rigi

nal L

1-L2

Tra

ffic

100

O

108

Sammp

100

O

101

Sapplu

100

O

101

Sapsi

100

O

101

Sart

100

O

101

Sequake

100

O

126

Sfacerec

100

O

106

Sfma3d

100

O

102

Sgalgel

100

O

100

Slucas

100

O

101

Smesa

100

O

100

Smgrid

100

O

100

Sswim

100

O

103

Swupwise

SPECINT

SPECFP

Perc

enta

ge o

f Orig

inal

L1-

L2 M

isse

s PrefetchesStore MissesLoad Misses

21

M2 D

C

ICS’02 C-K Luk

Overhead of Stride ProfilingOverhead of Stride Profiling

||0

|1000

|2000

|3000

No

rma

lize

d P

rofi

lin

g T

ime

(%

) 3043

1015

bzip2

23472551

crafty

1663

941

eon

1696 1782

gap

16001444

gcc

14471772

gzip

545 438

mcf

1544 1495

parser

1633 1729

perlbmk

16321440

twolf

20191770

vortex

1577 1568

vpr

17291495

AVG

SPECINT

SPECFP

Naive Instrumentation (5.5x slowdown)Optimized Instrumentation (4.4x slowdown)

||0

|1000

|2000

|3000

|4000

No

rmal

ized

Pro

filin

g T

ime

(%)

1207

599

ammp

1435

335

applu

1684

830

apsi

934

418

art

996756

equake

1777

1272

facerec

2894

960

fma3d

3046

460

galgel

1248

535

lucas

11261075

mesa

4098

317

mgrid

1934

623

sixtrack

1349

655

swim

1697

789

wupwise

1816

687

AVG

No r

mal

i zed

Pro

filin

g Ti

me

(%)

SPECFP

Profiling Time Expressed as % of the Original Execution Time

Similar overhead as other sw value profilersOptimized instrumentation is quite helpful– Reduces the overhead by two-thirds for SPECFP

22

M2 D

C

ICS’02 C-K Luk

Effectiveness of Sampled ProfilingEffectiveness of Sampled Profiling

||0

|1000

|2000

No

rmalize

d P

rofi

lin

g T

ime (

%)

1015

724

bzip2

2551

1472

crafty

941

619

eon

1782

1071

gap

1444

992

gcc

1772

1145

gzip

438292

mcf

1495

927

parser

1729

1103

perlbmk

1440

846

twolf

1770

1171

vortex

1568

1000

vpr

1495

947

AVG

||0

|500

|1000

No

rma

lize

d P

rofi

lin

g T

ime (

%)

599

384

ammp

335257

applu

830

550

apsi

418328

art

756

510

equake

1272

889

facerec

960

654

fma3d

460330

galgel

535394

lucas

1075

875

mesa

317225

mgrid

623

412

sixtrack

655516

swim

789

497

wupwise

687

487

AVG

Profiling Time Expressed as % of the Original Execution TimeN

orm

ali z

ed P

rofil

ing

Tim

e (%

)

SPECINT

SPECFP

Complete Profiling (“Optimized Instrumentation”)Sampled Profiling

Sampling speeds profiling up by ~50%Almost no performance degradation

23

M2 D

C

ICS’02 C-K Luk

ConclusionsConclusions

Broaden the scope of software prefetching:– Use profiling to detect statically unknown address strides– Apply prefetching without source code and recompilation

Significant performance gains:– 3% - 56% speedups in 11 SPEC2K benchmarks on Alpha

Profiling overhead can be largely reduced:– By instrumentation heuristics, sampling

Incorporated into Compaq Unix tools

24

M2 D

C

ICS’02 C-K Luk

Try it out yourself!

http://www.tru64unix.compaq.com/spike

25

M2 D

C

ICS’02 C-K Luk

Backups

26

M2 D

C

ICS’02 C-K Luk

Distribution of StridesDistribution of StridesLess Frequent Strides

||0

|20

|40

|60

|80

|100

Per

cen

tag

e o

f S

trid

ed L

oad

s

bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr AVG

||0

|20

|40

|60

|80

|100

Per

cen

tag

e o

f S

trid

ed L

oad

s

ammp applu apsi art equakefacerec fma3d galgel lucas mesa mgrid sixtrack swim wupwise AVG

SPECFP

Per c

enta

ge o

f Str i

ded

Load

sThe Most Frequent StrideSPECINT

27

M2 D

C

ICS’02 C-K Luk

Handling Multiple StridesHandling Multiple Strides(a) OriginalR2 <- load 16(R1)

(b) Choosing the Most Frequent Stride// k = most_frequent_stride * prefetching_distance + 16prefetch k(R1);R2 <- load 16(R1);

t <- R1 – t; // t has been holding the previous value of R1. // So, (R1 - t) is the stride

t <- prefetching_distance * t;t <- R1 + t;prefetch 16(t);t <- R1; // save R1 for computing the next strideR2 <- load 16(R1);

(c) Run-time Stride Detection

28

M2 D

C

ICS’02 C-K Luk

DS20E’s Memory HierarchyDS20E’s Memory Hierarchy

2.6 GB/secL2-to-Memory Bandwidth

6.9 GB/secL1-to-L2 Bandwidth

80+ cyclesL1-to-Memory Miss Latency

12+ cyclesL1-to-L2 Miss Latency

8MB, direct-mappedOff-Chip, Unified L2-Cache

32 in-flight loads, 32 in-flight stores,

8 in-flight cache block fills,8 cache victims

Memory Parallelism

64 KB, 2-way set-associativeD-Cache

64 KB, 2-way set-associativeI-Cache

64 bytesLine Size

29

M2 D

C

ICS’02 C-K Luk

Number of Instrumentation SitesNumber of Instrumentation Sites

||0

|20000

|40000

|60000

Nu

mb

er

of

Instr

um

en

tio

n P

oin

ts

66053273

bzip2

122306115

crafty

21227

9188

eon

25211

16413

gap

65122

34963

gcc

66213353

gzip

72103373

mcf

140037308

parser

25420

16956

perlbmk

17721

10245

twolf

17408

8919

vortex

111466141

vpr

||0

|20000

|40000

|60000

|80000

Nu

mb

er

of

Instr

um

en

tio

n P

oin

ts

105994052

ammp

27087

12347

applu

28537

14470

apsi

90894239

art

78823717

equake

26212

12914

facerec

80869

32904

fma3d

43423

16210

galgel

24434

12060

lucas

25150

11594

mesa

23137

11675

mgrid

60519

26459

sixtrack

23055

11967

swim

22430

11511

wupwise

SPECINT

SPECFP

Naive Instrumentation

Optimized Instrumentation

Num

ber o

f Ins

trum

enta

tion

Site

s

Reduced by a half or more in most cases

30

M2 D

C

ICS’02 C-K Luk

Instrumentation Heuristics and Prefetch Minimization Instrumentation Heuristics and Prefetch Minimization

||0

|50

|100

No

rmalize

d E

xecu

tio

n T

ime 98 97 100 98

bzip2

100 100 100 100

crafty

103 103 104 101

eon

90 90 90 90

gap

102 102 101 100

gcc

100 100 99 100

gzip

90 92 90 89

mcf

96 96 96 96

parser

102 102 100 102

perlbmk

103 102 102 103

twolf

100 101 100 99

vortex

96 96 97 96

vpr

98 98 98 98

AVG

||0

|50

|100

No

rma

lize

d E

xe

cu

tio

n T

ime 100101 99 99

ammp

85 85 85 82

applu

100100 99 100

apsi

100100100 99

art

64 64 64 64

equake

91 89 89 89

facerec

85 85 85 85

fma3d

106104102 99

galgel

91 91 91 90

lucas

89 90 89 91

mesa

102102100100

mgrid

96 96 96 98

sixtrack

102102101101

swim

97 97 96 97

wupwise

93 93 93 92

AVG

SPECFP

Nor

mal

ize d

Exe

cutio

n T i

me

SPECINTOnly not-scalar and not-compiler-prefetched

All loads are instrumentedOnly not-scalar loads

+ prefetch minimization

31

M2 D

C

ICS’02 C-K Luk

Performance of Sampled ProfilingPerformance of Sampled Profiling

Almost no performance degradation

||0

|50

|100

No

rma

lize

d E

xe

cu

tio

n T

ime 99 98

ammp

82 83

applu

100 100

apsi

99 99

art

64 64

equake

89 88

facerec

85 85

fma3d

99 100

galgel

90 90

lucas

91 89

mesa

100 100

mgrid

98 97

sixtrack

101 101

swim

97 97

wupwise

92 92

AVG

Nor

mal

ize

||0

|50

|100

No

rma

lize

d E

xe

cu

tio

n T

ime 98 98

bzip2

100 100

crafty

101 101

eon

90 90

gap

100 99

gcc

100 100

gzip

89 92

mcf

96 96

parser

102 101

perlbmk

103 104

twolf

99 99

vortex

96 96

vpr

98 98

AVG

SPECFPd E x

e cu t

ion

T im

eComplete Profiling

Sampled ProfilingSPECINT

32

M2 D

C

ICS’02 C-K Luk

HW vs. SW Stride PrefetchingHW vs. SW Stride PrefetchingHardware Stride Prefetching

– Chen and Baer’s scheme:• Computes strides at runtime using special hardware• Has a table that associates strides with loads

Advantages of HW stride prefetching:– Requires no recompilation or post-link optimizations– No instruction overhead– Adaptive

Advantages of SW stride prefetching:– No limit on how many static loads can be remembered– More programmable (e.g., choosing prefetching distances)– Applicable to most existing machines

33

M2 D

C

ICS’02 C-K Luk

Other Software Stride PrefetchingOther Software Stride PrefetchingCompiler-Inserted Stride Prefetching– Stoutchinin et al’s approach:

• Focuses on prefetching recurrent pointer updates• Does not use profiling

– Wu et al’s approach:• Does stride profiling at the source level• Reduces profiling overhead by code transformation

Post-Link Stride Prefetching– Barnes et al’s approach:

• Uses cache simulation to guide prefetch insertion

Documents

Profile-Guided Post-Link Stride Prefetching · facerec. 2894 960. fma3d. 3046 460. galgel. 1248 535. lucas. 11261075. mesa. 4098 317. mgrid. 1934 623. sixtrack. 1349 655. swim. 1697