View
222
Download
0
Category
Tags:
Preview:
Citation preview
Link-Time Path-Sensitive Memory Redundancy
Elimination
Manel Fernández and Roger Espasa{mfernand,roger}@ac.upc.es
Computer Architecture Department
Universitat Politècnica de Catalunya
Barcelona, Spain
Motivation
The memory “gap” Processor speed increases faster than memory speed
L1-cache latency continues to increase Memory operations remain a significant bottleneck
Memory redundancy Instructions that repeatedly access the same location
Lots of memory operations are redundant Hardware designers exploit memory redundancy
E.g., caches take advantage of temporal reuse
The compiler must be very aggressive in
memory optimizations
Memory redundancy
Memory instructions that repeatedly
access the same location Lots of memory operations are redundant
Sources of redundancy Source code structure
Programmers introduce redundancy
Traditional compilation Separate compilation units Limitations in the compilation model Code generation introduces redundancy
What percentage of memory
operations are redundant at run time?
… = *p;if ( … ){ *q = … … = *p;}
redundantload
redundancysource intervening
store
Dynamic memory redundancy
0
10
20
30
40
50
60
70
80
90
100
2 4 8 16 32 64 128 256 512 1024
Redundancy window size (entries)
Dy
na
mic
lo
ad
/sto
re r
ed
un
da
nc
y (
%)
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
Average
Loadredundancy
Storeredundancy
Talk outline
Motivation
Memory redundancy elimination (MRE)
Evaluation
Summary
Memory redundancy elimination (MRE)
Removal of memory instructions that repeatedly
access the same location Targeted at redundancy type
Load redundancy elimination (LRE) in a path-sensitive fashion– Based on path-sensitive memory disambiguation
Store redundancy elimination (SRE) Targeted at redundancy distance
Eliminating close/distant redundancy
In the context of a binary optimizer Overcome limitations of traditional compilers Need to deal with “executable code” problems
Load redundancy elimination (LRE)
Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE
Profile information is needed
Eliminating close redundancy Within extended basic blocks (EBBs)
Eliminating distant redundancy Intraprocedural dataflow analysis
[HorspoolHo97] For fully/partially-redundant loads
Redundancy on all/some paths Partial-LRE requires insertion of
speculative loads
R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97
Hot Path
move r0 , r2---------------
...I1 load (p0), r1 move r1 , r0 ...
...
I2 load (p0), r2 ...
Memory disambiguation
Register use-def chains Symbolic descriptors for every use Disambiguation by instruction inspection
Fails on path-sensitive redundancies
Need to deal with
path-sensitive information Partial-LRE is not sufficient either
...I0 def p0 ...I1 load (p0),r1 ...
... I3 add p0,8,p0 ...
IØ Ø-def p0 ... I2 load (p0),r2 ...
II21
IIIII2
I1
p0p0
)8p0,p0()p0,p0(p0
p0
0
0030
0
SS
S
S
00
00
II21
I2I1
p0p0
p0 ,p0
SS
SS
√
?
Path-sensitive memory
disambiguation Established for only a subset of all the
possible paths Subsumes generic disambiguation
Path-sensitive LRE Partial-LRE is now adapted for dealing
with path-sensitive redundancies Availability on edge (AVEDGij)
Path-sensitive redundancy
...I0 def p0 ...I1 load (p0),r1 move r1, r0 ...
... I3 add p0,8,p0 load (p0),r0 ...
IØ Ø-def p0 ... move r0, r2I2 load (p0),r2 ...
---------------
8p0p0
p0p0
)8p0,p0( ,p0
00
2
00
1
000
II21
II21
II2I1
psps
psps
SS
SS
SS
√
x
Store redundancy elimination (SRE)
...I1 store r1, (p0) ...I2 store r2, (p0) ...
----------------
Similar approach than LRE SRE on EBBs Full- and Partial-SRE
New formulation of the analysis No path-sensitive elimination!
Elimination of dead stores Other optimizations produce a lot
of dead stores Form of dead code elimination Based on heuristics
Includes a basic analysis for useless stack locations
...I1 load (p0), r0 ...I2 store r0, (p0) ...
----------------
Talk outline
Motivation
Memory redundancy elimination (MRE)
Evaluation
Summary
Methodology
Benchmark suite SPECint95
Compiled on an AlphaServer with full optimizations Intrumented using Pixie to get profiling information Aggressively re-optimized using Alto
Experimental framework Alto executable optimizer
Evaluation Dynamic number of loads/stores Actual execution time
AlphaServer GS-140, Alpha EV6-21264
Dynamic number of loads/stores
Dynamic number of loads
60%
65%
70%
75%
80%
85%
90%
95%
100%
go
m88
ksim gc
c
com
pres
s liijp
eg perl
vorte
x
Gmea
n
Benchmark
Dynamic number of stores
60%
65%
70%
75%
80%
85%
90%
95%
100%
go
m88
ksim gc
c
com
pres
s liijp
eg perl
vorte
x
Gmea
n
Benchmark
Basic
Full
Partial
Complete
Execution time
60%
65%
70%
75%
80%
85%
90%
95%
100%
go m88ksim gcc compress li ijpeg perl vortex Gmean
Benchmark
Ex
ec
uti
on
tim
e
Basic
Full
Partial
Complete
Relative execution time on an AlphaServer GS-140, Alpha EV6-21264 525MHz
Dynamic replay traps
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
go m88ksim gcc compress li ijpeg perl vortex Gmean
Benchmark
Dy
na
mic
Alp
ha
21
26
4 r
ep
lay
tra
ps
Basic
Full
Partial
Complete
Relative number of replay traps on the sim-alpha simulator, modeling an Alpha EV6-21264
Talk outline
Motivation
Memory redundancy elimination (MRE)
Evaluation
Summary
Summary
A high percentage of memory operations are redundant
Memory redundancy elimination (MRE) Removal of redundant memory operations
Load redundancy elimination (LRE) in a path-sensitive fashion– Based on path-sensitive memory disambiguation
Store redundancy elimination (SRE)– Including elimination of dead stores
For executable code or link-time Overcome limitations of traditional compilers
Valuable results on real execution time
Future directions Explore better alias analysis mechanism Additional techniques for MRE
Backup slides
Dynamic memory redundancy
Dynamic load redundancy (%)
0
10
20
30
40
50
60
70
80
90
100
2 4 8 16 32 64 128 256 512 1024
Redundancy window size (entries)
Dynamic store redundancy (%)
0
5
10
15
20
25
30
35
40
2 4 8 16 32 64 128 256 512 1024
Redundancy window size (entries)
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
Average
Dynamic load redundancy
0
10
20
30
40
50
60
70
80
90
100
2 4 8 16 32 64 128 256 512 1024
Redundancy window size (entries)
Dy
na
mic
lo
ad
re
du
nd
an
cy
(%
)
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
Average
Dynamic store redundancy
0
5
10
15
20
25
30
35
40
2 4 8 16 32 64 128 256 512 1024
Redundancy window size (entries)
Dy
na
mic
sto
re r
ed
un
da
nc
y (
%)
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
Average
Load redundancy elimination (LRE)
I1 loads a value from
memory into r1
I2 loads from the same
location into r2
Location (p0) is not
modified between I1
and I2
r1 can be safely
bypassed to r2
...I1 load (p0), r1
...
I2 load (p0), r2 ...
move r1 , r0 move r0 , r2---------------
I2 can be removed!
LRE on executable code
Is (p1) at I1 the same
memory location than
(p2) at I2?
Is there any available
register between I1 and
I2 that can be used to
bypass r1 to r2?
...I1 load (p1), r1
...
I2 load (p2), r2 ...
Alias analysis!
Register liveness
analysis!
move r1 , r0 move r0 , r2---------------
LRE: Eliminating close redundancy
For extended basic blocks (EBBs) Alias analysis: for disambiguation Register live analysis: for bypassing
Profile-guided LRE There is not always a benefit in
removing a redundant load
Hot Path
BCLRE
BBBBlatC
BBlatBfreqfreq
move
freqload
21
2
Need to evaluate cost-benefit of
applying LRE! move r0 , r2---------------
...I1 load (p0), r1 move r1 , r0 ...
...
I2 load (p0), r2 ...
LRE: Eliminating distant redundancy
For eliminating fully- and
partially- redundant loads Requires insertion of speculative loads
Dataflow analysis [HorspoolHo97] Extended cost equation
Complex search for available registers
...
...
I2 load (p0),r1 ...I1 store r1 ,(p0)
...
load (p0), r0
move r0 ,r1----------------
move r1 ,r0
insertbypass
m
i
freqsrcloadinsert
n
i
freqsrc
freqredmovebypass
CCC
EDGlatC
BBBBlatC
i
i
1
1
R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97
Load redundancy elimination (LRE)
Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE
Profile information is needed
Eliminating close redundancy Within extended basic blocks (EBBs)
Eliminating distant redundancy Intraprocedural dataflow analysis
[HorspoolHo97] For fully/partially-redundant loads Partial-LRE requires insertion of
speculative loads
R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97
Hot Path
move r0 , r2---------------
...I1 load (p0), r1 move r1 , r0 ...
...
I2 load (p0), r2 ...
Path-sensitive LRE
Path-sensitive redundancy Redundancy occurs only on some
execution paths Partial-LRE is not sufficient
Memory disambiguation Using register use-def chains Symbolic descriptors for every use
Path-sensitive memory
disambiguation is needed!
...I0 def p0 ...I1 load (p0),r1 ...
... I3 add p0,8,p0 ...
IØ Ø-def p0 ... I2 load (p0),r2 ...
21
IIIII2
I1
)8p0,p0()p0,p0(p0
p0
0030
0
SS
S
S
Path-sensitive information Disambiguation is established for only
a subset of all the possible paths For detecting path-sensitive exact
memory dependencies
Partial-LRE Algorithm is now adapted for dealing
with path-sensitive redundancies Availability on edge (AVEDGij)
Path-sensitive memory disambiguation
...I0 def p0 ...I1 load (p0),r1 move r1, r0 ...
... I3 add p0,8,p0 load (p0),r0 ...
IØ Ø-def p0 ... move r0, r2I2 load (p0),r2 ...
---------------
8p0p0
p0p0
)8p0,p0(
p0
00
2
00
1
00
0
II21
II21
II2
I1
psps
psps
SS
SS
S
S
√
x
A combined algorithm
Short-distance MRE Basic
MRE within EBBs
Long-distance MRE Full
Full-MRE Partial
Partial-MRE Complete
Path-sensitive LRE Partial SRE Dead store elimination
Easy optimizations(including Basic-MRE)
Easy optimizations(including Basic-MRE)
Function inliningFunction inlining
Long-distance MRE(Full/Partial/Complete)
Long-distance MRE(Full/Partial/Complete)
Easy optimizations(including Basic-MRE)
Easy optimizations(including Basic-MRE)
Easy optimizations(including Basic-MRE)
Easy optimizations(including Basic-MRE)
Dynamic number of loads
60%
65%
70%
75%
80%
85%
90%
95%
100%
go m88ksim gcc compress li ijpeg perl vortex Gmean
Benchmark
Dy
na
mic
nu
mb
er
of
loa
ds
Basic
Full
Partial
Complete
Dynamic number of stores
60%
65%
70%
75%
80%
85%
90%
95%
100%
go m88ksim gcc compress li ijpeg perl vortex Gmean
Benchmark
Dy
na
mic
nu
mb
er
of
sto
res
Basic
Full
Partial
Complete
Alpha 21264 results
Execution time
60%
65%
70%
75%
80%
85%
90%
95%
100%
go
m88
ksim gc
c
com
pres
s liijp
eg perl
vorte
x
Gmea
n
Benchmark
Dynamic number of replay traps
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
go
m88
ksim gc
c
com
pres
s liijp
eg perl
vorte
x
Gmea
n
Benchmark
Basic
Full
Partial
Complete
Recommended