Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Stream-based Memory Specialization for General Purpose Processors
Zhengrong Wang
Tony Nowatzki
2
Computation & Memory Specialization
ISA abstraction for certain computation pattern.
New ISA abstraction for memory access pattern?
+-+-+-+-
SIMD
+
+/
Compound Instruction
a[b[0]]
a[b[1]]
a[b[2]]
…
b[0]
b[1]
b[2]
…
b[i]
a[b[i]]
Stream
ComputationStructure
MemoryStructure
3
Conventional Memory Abstraction
addr
load
if
add
br
addr
load
if
add
br
O3 Core
Miss
L1 Cache L2 Cache
Hit
Resp.Resp.
Miss Hit
Resp.Resp.
Addr.
Addr.
Val.
Val.
while (i < N) {if (cond)
v += a[i];i++;
}
4
Opportunity 1: Prefetch with Ctrl. Flow
addr
load
if
addr
load
if
O3 Core
Miss
L1 Cache L2 Cache
Hit
Resp.
Miss Hit
Resp.
Addr.
Addr.
add
br add
br
Resp.
Resp.Val.
Overhead 1:Hard to prefetch with control flow.
Before loop.
cfg. SE. Miss Hit
Resp.Resp.
Prefetch.
Hit
Hit
Val.
Opportunity 1:Prefetch with control flow.
cfg(a[i]);while (i < N) {if (cond)
v += a[i];i++;
}
5
Opportunity 2: Semi-Binding Prefetch
if
if
O3 Core L1 Cache L2 Cache
add
br add
br
Before loop.
cfg. SE. Miss Hit
Resp.Resp.
Prefetch.
Hit
addr
load addr
load
Addr.
Addr.Resp.
Resp.Val.
HitOpportunity 1:Prefetch with control flow.
Overhead 2:Redundant address computation.
FIFOOpportunity 2:Semi-binding prefetch.
s_a = cfg();while (i < N) {if (cond)
v += s_a;i++;
}
6
Opportunity 3: Stream-Aware Policies
if
if
O3 Core L1 Cache L2 Cache
add
br add
br
Before loop.
cfg. SE. Hit
Resp.
Prefetch.
Opportunity 1:Prefetch with control flow.
Overhead 2:Repeated address computation/loads.
FIFO
Miss
Resp.
Opportunity 2:Semi-binding prefetch.
Overhead 3:Assumption on reuse causes extra access/conflicts
Opportunity 3:Better policies, e.g. bypass a cache level if no locality.
s_a = cfg();while (i < N) {if (cond)
v += s_a;}
7
Stream: A New ISA Memory Abstraction
• Potential Benefits of Stream Abstractions– Energy-efficient prefetching.
– Offload Memory from expensive OOO Pipeline
– Leverage stream information in cache policies.
• This Work– Analyze programs to determine the requirements of the ISA
– Develop a hardware-neutral ISA extension: decoupled-stream ISA
– Develop micro-architecture extensions for an OOO processor
• Evaluation– 60% memory accesses → streams.
– 1.37 speedup over a traditional O3 processor.
8
Outline• Stream Characteristics.
• Stream ISA Extension.
• Stream-Aware Policies.
• Microarchitecture Extension.
• Evaluation.
9
Outline• Stream Characteristics.
• Stream ISA Extension.
• Stream-Aware Policies.
• Microarchitecture Extension.
• Evaluation.
10
Stream Definition
• Stream -- Decoupled portion of a program:– Accesses memory at most one times
– No internal control flow
– Has no dependences from the main program to the stream besides configuration and termination
a[i][j]
Main Program
for (int i = 0; i < M; ++i) {for (int j = 0; j < N; ++j) {… = a[i][j];
Stream
11
Stream Characteristics – Stream TypeTrace analysis on CortexSuite/SPEC CPU 2017.
• 51.49% affine, 10.19% indirect.
• Indirect streams can be as high as 40%.
0%10%20%30%40%50%60%70%80%90%
100%
Affine Indirect PC Unqualified Outside
Support indirect stream.
12
Stream Characteristics – Stream Length• 51% stream accesses from stream longer than 1k.
• Some benchmarks contain short streams.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
pca rbm disparity lbm_s sphinx srr svm xz_s avg.
>1k >100 >50 >0
Support longer stream to capture long term behavior.Low overhead to support short streams.
13
Stream Characteristics – Control Flow• 53% stream accesses from loop with control flow.
0%10%20%30%40%50%60%70%80%90%
100%
Execution Paths within the Loop >3 3 2 1
Decouple from control flow.
15
Outline• Stream Characteristics.
• Stream ISA Extension.
• Stream-Aware Policies.
• Microarchitecture Extension.
• Evaluation.
16
Decoupled Stream ISA Components
• Setup/Teardown– stream_config(s_i) -- setup stream parameters
– stream_end(s_i) -- deallocates streams
• Pseudo Register– Register mapped to stream data
– Allows offset to access different elements of stream
• Supporting Control Flow– Decoupled stream advance from use of pseudo-register
– step(s_i) – update the position of a stream by one iter.
17
Stream ISA Extension – Basic Example
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph
int i = 0;while (i < N) {sum += a[i];i++;
}
stream_cfg(s_i, s_a);while (s_i < N) {sum += s_a;stream_step(s_i);
}stream_end(s_i, s_a);
s_i
s_a
Memory 0x408
…
Memory 0x400
Memory 0x404
Stream a[i]
s_ai++
UserStep.
i++
Pseudo-RegIter.
0
1
2
Induction Var. Stream
MemoryStream
18
Stream ISA Extension – Control FlowOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph
int i = 0, j = 0;while (cond) {if (a[i] < b[j])
i++;else
j++;}
stream_cfg(s_i, s_a, s_j, s_b);while (cond) {
if (s_a < s_b)stream_step(s_i);
elsestream_step(s_j);
}stream_end(s_i, s_a, s_j, s_b);
s_i
s_a
s_j
s_b
Memory 0x408
…
Memory 0x400
Memory 0x404
Stream a[i]
s_a
i++
UserStep
i++
Pseudo-RegIter.
0
1
2
19
Stream ISA Extension – Indirect StreamOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph
int i = 0;while (i < N) {sum += a[b[i]];i++;
}
stream_cfg(s_i, s_a, s_b);while (s_i < N) {
sum += s_a;stream_step(s_i);
}stream_end(s_i, s_a, s_b);
s_i
s_b
s_a
Memory 0x86c
…
Memory 0x888
Memory 0x668
a[b[i]]
i++
UserStep
i++
Pseudo-Reg
Memory 0x408
…
Memory 0x400
Memory 0x404
b[i]
s_a s_b
Pseudo-RegIter.012
20
Implications of Decoupled-Stream Extensions
• New architectural states:– Stream configuration.
– Current iteration’s data
• Maintain the memory order.– Load → first use of the pseudo-register after configured/stepped.
• New speculation in ISA:– Stream elements will be used.
– Streams are not too short (overfetch).
• “Vectorize” access that was not vectorizable before– From the perspective of L1-Cache->Core
– Regular parts of irregular program regions
21
Outline• Stream Characteristics.
• Stream ISA Extension.
• Stream-Aware Policies.
• Microarchitecture Extension.
• Evaluation.
22
Microarchitecture
Pseudo-Reg Memory 0x408
Memory 0x40c
Memory 0x400
Memory 0x404
Memory 0x410
Stream
23
Microarchitecture – Misspeculation• Control misspeculated stream_step.
– Decrement the iteration map.
– No need to flush the FIFO and re-fetch data (decoupled) !
• Other misspeculation.– Revert the stream states, including stream FIFO.
• Memory fault delayed until the use of the element.
24
Outline• Stream Characteristics.
• Stream ISA Extension.
• Microarchitecture Extension.
• Stream-Aware Policies.
• Evaluation.
25
Methodology
• Compiler in LLVM:– Identify stream candidates.
– Generate stream configuration.
– Transform the program.
• Gem5 + McPAT simulation.
• 33 Benchmarks:– SPEC2017 C/CPP benchmarks.
– CortexSuite.
• SimPoint:– 10 million instructions’ simpoints.
10 simpoints per benchmark.
27
Results – Overall Performance
0
1
2
3
4
5
6
7
Pf-Stride SSP-Non-Bind SSP-Semi-Bind SSP-Cache-Aware Pf-Helper
TraditionalStride
Prefetcher
Stream is just a
prefetch
Stream instructions
re decoupled
Stream-based Cache Bypassing
Ideal Prefetch 1K instructions
ahead
28
Results – Design Space Interaction
0.5
0.6
0.7
0.8
0.9
1
1.1
1 1.5 2 2.5 3
Ener
gy
CortexSuite Speedup
OOO[2,6,8] Pf-Stride[2,6,8]
0.5
0.6
0.7
0.8
0.9
1
1.1
1 1.5 2 2.5 3
Ener
gy
SPEC CPU 2017 Speedup
Pf-Helper[2,6,8] SSP-Cache-Aware[2,6,8]
29
Conclusion
• Stream as a new memory abstraction in ISA.– ISA/Microarchitecture extension.
– Stream-aware cache bypassing.
• New paradigm of memory specialization.– New direction for improving cache architectures.
– Combine memory and computation specialization.
30
Stream-Aware Policies
Compiler (ISA)/Hardware
Memory Footprint
Reuse Distance
Modified?
Conditional Used?
Indirect
…
Prefetch Throttling
Cache Replacement
Cache Bypassing
Sub-Line Transfer
…
Rich Information Better Policies
31
Backup Slides
32
Results – Semi-Binding Prefetching
00.20.40.60.8
1
Remain Insts Added Insts
1
1.5Speedup of Semi-Binding Prefetch vs. Non-Binding Prefetch
33
Related Work• Decouple access execute.
– Outrider [ISCA’11], DeSC [MICRO’15], etc.
– Ours: New ISA abstraction for the access engine.
• Prefetching.– Stride, IMP [MICRO’15], etc.
– Ours: Explicit access pattern in ISA.
• Cache bypassing policy.– Counter-based [ICCD’05], LLC bypassing [ISCA’11], etc.
– Ours: Incorporate static stream information.
34
Stream-Aware Policies – Cache Bypass• Stream: Access Pattern → Precise Memory Footprint.
while (i < N) while (j < N)
while (k < N) sum += a[k][i]
* b[k][j];
Core
L1$
L2$
s_as_b
s_b s_a
a[N][N] b[N][N]
Reuse Dist. Reuse Dist.
35
Results – Unused Stream Requests
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
lda
liblin
ear
mo
tio
n-e
stim
atio
np
carb
msp
hin
xsr
rsv
d3
dis
par
ity
loca
lizat
ion
mse
rm
ult
i_n
cut
sift
stit
chsv
mte
xtu
re_
syn
thes
istr
acki
ng
nam
d_r
par
est_
rp
ovr
ay_r
ble
nd
er_r
per
lben
ch_s
gcc_
sm
cf_s
lbm
_so
mn
etp
p_s
xala
ncb
mk_
sx2
64_
sd
eep
sjen
g_s
imag
ick_
sle
ela_
sn
ab_s
xz_s
avg.
36
Stream ISA Extension – Compiler in LLVM• Identify stream candidates.– Search backwards on operands of candidates.
–Detect pattern and dependency between candidates.
Identify Stream
Candidates
SelectStream
Candidates
Code Generation
• Select stream candidates.– Check decouplable pattern.
– Limited number of pseudo-registers.
• Code generation.– Stream configuration, e.g. type, parameters, etc.
– Transform the program.
37
Stream: A New ISA Memory AbstractionHigher level of abstraction embeds richer semantic info.
Opens up new opportunities.
1.37x speedup over a hardware stride prefetch.
Core
Mem
Acc.(Address, Value):(0x400, 1)(0x404, 2)(0x408, 3)…
Streamin ISA
• What stream patterns to support?• How to deal with control flow?• ISA microarchitecture co-design?• …
• Prefetch with control flow.• Decouple address computation.• Better cache policies.• …
38
Insight: Stream as a new ISA abstraction• Higher level abstraction than single memory requests.
– Access pattern, memory footprint, etc.
– More accurate & efficient than hardware approaches, e.g. hardware prefetching.
• Rich information brings new opportunities:– More accurate & timely prefetching.
– Removing redundant work, e.g. address computation.
– Better cache policies.
39
Stream ISA Extension – Basic Principles• Stream: A decoupled sequence of values/addresses.
• Usage of stream data:– Referred by a pseudo-register.
– Explicitly stepped by a stream_step inst.
Arch. Reg Phys. Reg
Memory 0x408
Memory 0x40c
Memory 0x400
Memory 0x404
Memory 0x410
Stream
Pseudo-Reg
UserStep.
40
Stream ISA Extension – Nested Loop• Indirect Stream.
• Decoupled from control flow.
• Longer streams: Configure stream in outer loop.
• Coalesce stream.
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph
int i = 0;while (i < M) { int j = 0; while (j < N) { sum += a[i][j]; j++; } i++;}
int i = 0;stream_cfg(s_j, s_a);while (i < M) { while (s_j < N) { sum += s_a; stream_step(s_j); } stream_step(s_j); i++;}stream_end(s_j, s_a);
s_a
s_j
41
Stream ISA Extension – Coalesce Stream• Indirect Stream.
• Decoupled from control flow.
• Longer streams (nested loop).
• Coalesce stream.
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graphint i = 0;while (i < N) { b[i] = a[i].x + a[i].y; i++;}
stream_cfg(s_i, s_a, s_b);while (s_i < N) { s_b = s_a.x + s_a.y; stream_step(s_i);}stream_end(s_i, s_a, s_b);
s_i
s_b s_a
42
Microarchitecture – Core View
Goal: What is the current iteration of a stream?
s step step s step s
0Stream 1 Iter.
Dispatch Iter. Map
Stream User Inst.
steam_step Inst.
Lookup
Increment
43
Microarchitecture – Core View
Goal: What is the current iteration of a stream?
s step step s step 0
1Stream 1 Iter.
Dispatch Iter. Map
Stream User Inst.
steam_step Inst.
Lookup
Increment
44
Microarchitecture – Core View
Goal: What is the current iteration of a stream?
s step step s step 0
1Stream 1 Iter.
Dispatch Iter. Map
Stream User Inst.
steam_step Inst.
Lookup
Increment
45
Microarchitecture – Core View
Goal: What is the current iteration of a stream?
s step step 1 step 0
2Stream 1 Iter.
Dispatch Iter. Map
Stream User Inst.
steam_step Inst.
Lookup
Increment
46
Microarchitecture – Core View
Goal: What is the current iteration of a stream?
3 step step 1 step 0
3Stream 1 Iter.
Dispatch Iter. Map
Stream User Inst.
steam_step Inst.
Lookup
Increment
47
Microarchitecture – Revert stream_step• Control misspeculated stream_step.
– Decrement the iteration map.
– No need to flush the FIFO and re-fetch data (decoupled) !
Memory 0x408
Pseudo-Reg Memory 0x40c
Memory 0x400
Memory 0x404
Memory 0x410
Stream
UserStep.
48
Stream ISA Extension – Stepping• Explicitly stepping with stream_step instruction.
Memory 0x408
Pseudo-Reg
Memory 0x40c
Memory 0x400
Memory 0x404
Memory 0x410
Stream
UserStep.
49
Microarchitecture – Stream View
Managed by the Stream Engine.
Definition:Address pattern parameters,Initial address, stride, etc.
State:Current iteration.
Operands:Index from cache.For indirect streams.
50
Microarchitecture – Aliasing Streams
• Leverage the LSQ to handle aliasing.– Extend the LSQ with PEB to record prefetching stream elements.
– Insert into the LSQ for the user instruction that semantically triggers the stream access.
– Revert and restart at memory order violation.
51
Stream-Aware Policies - Throttling
Goal: Prefetch just in time.Evenly distribute FIFO among streams leads to untimely prefetching.
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
…
…
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Inst
Stream 1
Stream 2
Not Ready
Ready
52
Stream-Aware Policies - Throttling
A “Late” counter per stream.Increased every time a user instruction blocked by the stream.
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
…
…
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Inst
Stream 1
Stream 2
Not Ready
Ready
1
0
Late Counter
Increment
53
Stream-Aware Policies - Throttling
A “Late” counter per stream.Allocate more FIFO entries to the stream with high “Late” count.
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
…
…
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Inst
Stream 1
Stream 2
Not Ready
Ready
1
0
Late Counter
Increment
54
Stream-Aware Policies – Throttling
A “Late” counter per stream.After some iterations, stream 1 has enough entries to hide the latency.
4 5 6 7 8 9 10 11
4 5 6 7 8 9 10 11
…
…
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Inst
Stream 1
Stream 2
Not Ready
Ready
0
0
Late Counter
55
Stream-Aware Policies – Cache Bypass
0 1 2 3 4 5 6 7 …
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Stream 1
Not Fetching
Fetching
MSHR
0 1
0 1 2 3
L1
L2
Memory
Overhead 1: Caching low reuse data.
Overhead 2: Unused L2 MSHRs.
• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.
– Bypass.
56
Stream-Aware Policies – Cache Bypass
0 1 2 3 4 5 6 7 …
Iterations (Stream Elements)
Allocated FIFO Entry Unallocated FIFO Entry
Stream 1
Not Fetching
Fetching
MSHR
0 1 2 3
L1 (Bypassed)
L2
Memory
Benefit 1: Do not evict data in L1.
Benefit 2: Increase memory parallelism.
• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.
– Bypass streams with low reuse.
57
Stream-Aware Policies – Cache Bypass
Goal: Correctly identify streams with low reuse.A stream table to combine static & dynamic information.
Tags are augmented with the stream id.
A FSM to bypass streams with:
✓ High miss count &&
✓ Low reuse count &&
✓ (Large footprint || No footprint info)
58
Results – Stream-Aware Cache Bypassing
3.40
1
1.1
1.2
1.3
1.4
1.5
SSP-Cache-Aware vs. SSP-Semi-Bind
Non-Binding Prefetch
Semi-Binding Prefetch
Stream-AwareCache
59