Upload
aleron
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Instruction Based Memory Distance Analysis and its Application to Optimization. Changpeng Fang Steve Carr Soner Önder Zhenlin Wang. Motivation. Widening gap between processor and memory speed memory wall Static compiler analysis has limited capability regular array references only - PowerPoint PPT Presentation
Citation preview
1
Instruction Based Memory Distance Analysis and its Application to
Optimization
Changpeng FangSteve Carr
Soner ÖnderZhenlin Wang
2
Motivation Widening gap between processor and memory speed
memory wall Static compiler analysis has limited capability
regular array references only index arrays integer code
Reuse distance prediction across program inputs number of distinct memory locations accessed between two
references to the same memory location applicable to more than just regular scientific code locality as a function of data size predictable on whole program and per instruction basis for
scientific codes
3
Motivation
Memory distance A dynamic quantifiable distance in terms of memory
reference between tow access to the same memory location.
reuse distance access distance value distance
Is memory distance predictable across both integer and floating-point codes?
predict miss rates predict critical instructions identify instructions for load speculation
4
Related Work Reuse distance
Mattson, et al. ’70 Sugamar and Abraham ’94 Beyls and D’Hollander ’02 Ding and Zhong ’03 Zhong, Dropsho and Ding ’03 Shen, Zhong and Ding ’04 Fang, Carr, Önder and Wang ‘04 Marin and Mellor-Crummey ’04
Load speculation Moshovos and Sohi ’98 Chyrsos and Emer ’98 Önder and Gupta ‘02
5
Background Memory distance
can use any granularity (cache line, address, etc.) either forward or backward represented as a pattern
Represent memory distance as a pattern divide consecutive ranges into intervals we use powers of 2 up to 1K and then 1K intervals
Data size the largest reuse distance for an input set characterize reuse distance as a function of the data size Given two sets of patterns for two runs, can we predict a
third set of patterns given its data size?
6
Background
Let be the distance of the ith bin in the first pattern and be that of the second pattern. Given the data sizes s1 and s2 we can fit the memory distances using
Given ci, ei, and fi, we can predict the memory distance of another input set with its data size
1id
2id
11
22
( )
( )
i i i i
i i i i
d c e f s
d c e f s
7
Instruction Based Memory Distance Analysis
How can we represent the memory distance of an instruction?
For each active interval, we record 4 words of data• min, max, mean, frequency
Some locality patterns cross interval boundaries• merge adjacent intervals, i and i + 1, if
• merging process stops when a minimum frequency is found• needed to make reuse distance predictable
The set of merged intervals make up memory distance patterns
1min max max mini i i i
8
Merging Example
9
What do we do with patterns?
Verify that we can predict patterns given two training runs
coverage accuracy
Predict miss rates for instructions Predict loads that may be speculated
10
Prediction Coverage
Prediction coverage indicates the percentage of instructions whose memory distance can be predicted
appears in both training runs access pattern appears in both runs and memory distance
does not decrease with increase in data size (spatial locality)
• same number of intervals in both runs Called a regular pattern
For each instruction, we predict its ith pattern by curve fitting the ith pattern of both training runs applying the fitting function to construct a new min, max
and mean for the third run Simple, fast prediction
11
Prediction Accuracy
An instruction’s memory distance is correctly predicted if all of its patterns are predicted correctly
predicted and observed patterns fall in same interval or, given two patterns A and B such that
B.min A.max B.max
.max max( .min, .min)0.9
max( .max .min, .max .min)
A A B
B B A A
12
Experimental Methodology
Use 11 CFP2000 and 11 CINT2000 benchmarks others don’t compile correctly
Use ATOM to collect reuse distance statistics Use test and train data sets for training runs Evaluation based on dynamic weighting Report reuse distance prediction accuracy
value and access very similar
13
Reuse Distance Prediction
Suite Patterns Coverage%
Accuracy%
%constant %linear
CFP2000 85.1 7.7 93.0 97.6
CINT2000 81.2 5.1 91.6 93.8
14
Coverage issues
Reasons for no coverage1. instruction does not appear in at least one test run2. reuse distance of test is larger than train3. number of patterns does not remain constant in both
training runs
Suite Reason 1 Reason 2 Reason 3
CFP2000 4.2% 0.3% 2.5%
CINT2000 2.2% 4.4% 1.8%
15
Prediction Details
Other patterns 183.equake has 13.6% square root patterns 200.sixtrack, 186.crafty all constant (no data size
change) Low coverage
189.lucas – 31% of static memory operations do not appear in training runs
164.gzip – the test reuse distance greater than train reuse distance
• cache-line alignment
16
Number of Patterns
Suite 1 2 3 4 5
CFP2000 81.8% 10.5% 4.8% 1.4% 1.5%
CINT2000 72.3% 10.9% 7.6% 4.6% 5.3%
17
Miss Rate Prediction
Predict a miss for a reference if the backward reuse distance is greater than the cache size.
neglects conflict misses Accurate miss rate prediction
1max ,
actual predicted
actual predicted
18
Miss Rate Prediction Methodology
Three miss-rate prediction schemes TCS – test cache simulation
• Use the actual miss rates from running the program on a the test data for the reference data miss rates
RRD – reference reuse distance• Use the actual reuse distance of the reference data set
to predict the miss rate for the reference data set• An upper bound on using reuse distance
PRD –predicted reuse distance• Use the predicted reuse distance for the reference data
set to predict the miss rate.
19
Cache Configurations
config no. L1 L21 32K, fully assoc. 1M fully assoc.
234
32K, 2-way 1M8-way4-way2-way
20
L1 Miss Rate Prediction Accuracy
Suite PRD RRD TCS
CFP2000 97.5 98.4 95.1
CINT2000 94.4 96.7 93.9
21
L2 Miss Rate Accuracy
Suite 2-way Fully Associative
PRD RRD TCS PRD RRD TCS
CFP2000 91% 93% 87% 97% 99.9% 91%
CINT2000 91% 95% 87% 94% 99.9% 89%
22
Critical Instructions
Given reuse distance for an instruction Can we determine which instructions are critical in terms of
cache performance? An instruction is critical if it is in the set of instructions
that generate the most L2 cache misses Those top miss-rate instructions whose cumulative total
misses account for 95% of the misses in a program. Use the execution frequency of one training run to
determine the relative contribution number of misses for each instruction
Compare the actual critical instructions with predicted Use cache configuration 2
23
Critical Instruction Prediction
Suite PRD RRD TCS %pred %act
CPF2000 92% 98% 51% 1.66% 1.67%
CINT2000 89% 98% 53% 0.94% 0.97%
24
Critical Instruction Patterns
Suite 1 2 3 4 5
CFP2000 22.1 38.4 20.0 12.8 6.7
CINT2000 18.7 14.5 25.5 22.5 18
25
Miss Rate Discussion
PRD performs better than TCS when data size is a factor
TCS performs better when data size doesn’t change much and there are conflict misses
PRD is much better at identifying the critical instructions than TCS
these instructions should be targets of optimization
26
Memory Disambiguation
Load speculation Can a load safely be issued prior to a preceding store? Use a memory distance to predict the likelihood that a
store to the same address has not finished Access distance
The number of memory operations between a store to and load from the same address
Correlated to instruction distance and window size Use only two runs
• If access distance not constant, use the access distance of larger of two data sets as a lower bound on access distance
27
When to Speculate
Definitely “no” access distance less than threshold
Definitely “yes” access distance greater than threshold
Threshold lies between intervals compute predicted mis-speculation frequency (PMSF)
• speculate is PMSF < 5% When threshold does not intersect intervals
• total of frequencies that lie below the threshold Otherwise
( min)
(max min)
thesholdfrequency
28
Value-based Prediction
Memory dependence only if addresses and values match
store a1, v1
store a2, v2
store a3, v3
load a4, v4
Can move ahead if a1=a2=a3=a4, v2=v3 and v1≠v2
The access distance of a load to the first store in a sequence of stores storing the same value is called the value distance
29
Experimental Design
SPEC CPU2000 programs SPEC CFP2000
• 171.swim, 172.mgrid, 173.applu, 177.mesa, 179.art, 183.equake, 188.ammp, 301.apsi
SPEC CINT2000• 164.gzip, 175.vpr, 176.gcc, 181.mcf, 186.crafty,
197.parser, 253.perlbmk, 300.twolf
Compile with gcc 2.7.2 –O3 Comparison
Access distance, value distance Store set with 16KB table, also with values Perfect disambiguation
30
Micro-architecture
issue width 8
fetch width 8
retire width 16
window size 128
load/store queue
128
functional units
8
fetch multiblock gshare
data cache perfect
memory ports 2
Operation Latency
load 2
int division 8
int multiply 4
other int 1
float multiply 4
float addition 3
float division 8
other float 2
31
IPC and Mis-speculation
SuiteAccess
DistanceStore Set
16KB TablePerfect
CFP2000 3.21 3.37 3.71
CINT2000 2.90 3.22 3.35
Suite
Mis-speculation Rate
% Speculated Loads
Access Store Set
Access Store Set
CFP2000 2.36 0.07 57.2 62.0
CINT2000 2.33 0.08 26.9 34.7
32
Value-based Disambiguation
SuiteValue
DistanceStore Set
16KB Value
CFP2000 3.34 3.55
CINT2000 3.00 3.23
SuiteMis-
speculation Rate
% Speculated Loads
CFP2000 1.22 59.3
CINT2000 1.55 27.6
33
Cache Model
Suite Access Store Set 16K
CFP2000 1.55 1.61
CINT2000 1.53 1.60
Suite Value Store Set 16K
CFP2000 1.59 1.63
CINT2000 1.55 1.65
34
Summary
Over 90% of memory operations can have reuse distance predicted with a 97% and 93% accuracy, for floating-point and integer programs, respectively
We can accurately predict miss rates for floating-point and integer codes
We can identify 92% of the instructions that cause 95% of the L2 misses
Access- and value-distance-based memory disambiguation are competitive with best hardware techniques without a hardware table
35
Future Work
Develop a prefetching mechanism that uses the identified critical loads.
Develop an MLP system that uses critical loads and access distance.
Path-sensitive memory distance analysis Apply memory distance to working-set based
cache optimizations Apply access distance to EPIC style
architectures for memory disambiguation.