Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache

1

Stream-based Memory Specialization for General Purpose Processors

Zhengrong Wang

Tony Nowatzki

2

Computation & Memory Specialization

ISA abstraction for certain computation pattern.

New ISA abstraction for memory access pattern?

+-+-+-+-

SIMD

+

+/

Compound Instruction

a[b[0]]

a[b[1]]

a[b[2]]

…

b[0]

b[1]

b[2]

…

b[i]

a[b[i]]

Stream

ComputationStructure

MemoryStructure

3

Conventional Memory Abstraction

addr

load

if

add

br

addr

load

if

add

br

O3 Core

Miss

L1 Cache L2 Cache

Hit

Resp.Resp.

Miss Hit

Resp.Resp.

Addr.

Addr.

Val.

Val.

while (i < N) {if (cond)

v += a[i];i++;

}

4

Opportunity 1: Prefetch with Ctrl. Flow

addr

load

if

addr

load

if

O3 Core

Miss

L1 Cache L2 Cache

Hit

Resp.

Miss Hit

Resp.

Addr.

Addr.

add

br add

br

Resp.

Resp.Val.

Overhead 1:Hard to prefetch with control flow.

Before loop.

cfg. SE. Miss Hit

Resp.Resp.

Prefetch.

Hit

Hit

Val.

Opportunity 1:Prefetch with control flow.

cfg(a[i]);while (i < N) {if (cond)

v += a[i];i++;

}

5

Opportunity 2: Semi-Binding Prefetch

if

if

O3 Core L1 Cache L2 Cache

add

br add

br

Before loop.

cfg. SE. Miss Hit

Resp.Resp.

Prefetch.

Hit

addr

load addr

load

Addr.

Addr.Resp.

Resp.Val.

HitOpportunity 1:Prefetch with control flow.

Overhead 2:Redundant address computation.

FIFOOpportunity 2:Semi-binding prefetch.

s_a = cfg();while (i < N) {if (cond)

v += s_a;i++;

}

6

Opportunity 3: Stream-Aware Policies

if

if

O3 Core L1 Cache L2 Cache

add

br add

br

Before loop.

cfg. SE. Hit

Resp.

Prefetch.

Opportunity 1:Prefetch with control flow.

Overhead 2:Repeated address computation/loads.

FIFO

Miss

Resp.

Opportunity 2:Semi-binding prefetch.

Overhead 3:Assumption on reuse causes extra access/conflicts

Opportunity 3:Better policies, e.g. bypass a cache level if no locality.

s_a = cfg();while (i < N) {if (cond)

v += s_a;}

7

Stream: A New ISA Memory Abstraction

• Potential Benefits of Stream Abstractions– Energy-efficient prefetching.

– Offload Memory from expensive OOO Pipeline

– Leverage stream information in cache policies.

• This Work– Analyze programs to determine the requirements of the ISA

– Develop a hardware-neutral ISA extension: decoupled-stream ISA

– Develop micro-architecture extensions for an OOO processor

• Evaluation– 60% memory accesses → streams.

– 1.37 speedup over a traditional O3 processor.

8

Outline• Stream Characteristics.

• Stream ISA Extension.

• Stream-Aware Policies.

• Microarchitecture Extension.

• Evaluation.

9





• Evaluation.

10

Stream Definition

• Stream -- Decoupled portion of a program:– Accesses memory at most one times

– No internal control flow

– Has no dependences from the main program to the stream besides configuration and termination

a[i][j]

Main Program

for (int i = 0; i < M; ++i) {for (int j = 0; j < N; ++j) {… = a[i][j];

Stream

11

Stream Characteristics – Stream TypeTrace analysis on CortexSuite/SPEC CPU 2017.

• 51.49% affine, 10.19% indirect.

• Indirect streams can be as high as 40%.

0%10%20%30%40%50%60%70%80%90%

100%

Affine Indirect PC Unqualified Outside

Support indirect stream.

12

Stream Characteristics – Stream Length• 51% stream accesses from stream longer than 1k.

• Some benchmarks contain short streams.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

pca rbm disparity lbm_s sphinx srr svm xz_s avg.

>1k >100 >50 >0

Support longer stream to capture long term behavior.Low overhead to support short streams.

13

Stream Characteristics – Control Flow• 53% stream accesses from loop with control flow.

0%10%20%30%40%50%60%70%80%90%

100%

Execution Paths within the Loop >3 3 2 1

Decouple from control flow.

15





• Evaluation.

16

Decoupled Stream ISA Components

• Setup/Teardown– stream_config(s_i) -- setup stream parameters

– stream_end(s_i) -- deallocates streams

• Pseudo Register– Register mapped to stream data

– Allows offset to access different elements of stream

• Supporting Control Flow– Decoupled stream advance from use of pseudo-register

– step(s_i) – update the position of a stream by one iter.

17

Stream ISA Extension – Basic Example

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < N) {sum += a[i];i++;

}

stream_cfg(s_i, s_a);while (s_i < N) {sum += s_a;stream_step(s_i);

}stream_end(s_i, s_a);

s_i

s_a

Memory 0x408

…

Memory 0x400

Memory 0x404

Stream a[i]

s_ai++

UserStep.

i++

Pseudo-RegIter.

0

1

2

Induction Var. Stream

MemoryStream

18

Stream ISA Extension – Control FlowOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0, j = 0;while (cond) {if (a[i] < b[j])

i++;else

j++;}

stream_cfg(s_i, s_a, s_j, s_b);while (cond) {

if (s_a < s_b)stream_step(s_i);

elsestream_step(s_j);

}stream_end(s_i, s_a, s_j, s_b);

s_i

s_a

s_j

s_b

Memory 0x408

…

Memory 0x400

Memory 0x404

Stream a[i]

s_a

i++

UserStep

i++

Pseudo-RegIter.

0

1

2

19

Stream ISA Extension – Indirect StreamOriginal C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < N) {sum += a[b[i]];i++;

}

stream_cfg(s_i, s_a, s_b);while (s_i < N) {

sum += s_a;stream_step(s_i);

}stream_end(s_i, s_a, s_b);

s_i

s_b

s_a

Memory 0x86c

…

Memory 0x888

Memory 0x668

a[b[i]]

i++

UserStep

i++

Pseudo-Reg

Memory 0x408

…

Memory 0x400

Memory 0x404

b[i]

s_a s_b

Pseudo-RegIter.012

20

Implications of Decoupled-Stream Extensions

• New architectural states:– Stream configuration.

– Current iteration’s data

• Maintain the memory order.– Load → first use of the pseudo-register after configured/stepped.

• New speculation in ISA:– Stream elements will be used.

– Streams are not too short (overfetch).

• “Vectorize” access that was not vectorizable before– From the perspective of L1-Cache->Core

– Regular parts of irregular program regions

21





• Evaluation.

22

Microarchitecture

Pseudo-Reg Memory 0x408

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

23

Microarchitecture – Misspeculation• Control misspeculated stream_step.

– Decrement the iteration map.

– No need to flush the FIFO and re-fetch data (decoupled) !

• Other misspeculation.– Revert the stream states, including stream FIFO.

• Memory fault delayed until the use of the element.

24





• Evaluation.

25

Methodology

• Compiler in LLVM:– Identify stream candidates.

– Generate stream configuration.

– Transform the program.

• Gem5 + McPAT simulation.

• 33 Benchmarks:– SPEC2017 C/CPP benchmarks.

– CortexSuite.

• SimPoint:– 10 million instructions’ simpoints.

10 simpoints per benchmark.

27

Results – Overall Performance

0

1

2

3

4

5

6

7

Pf-Stride SSP-Non-Bind SSP-Semi-Bind SSP-Cache-Aware Pf-Helper

TraditionalStride

Prefetcher

Stream is just a

prefetch

Stream instructions

re decoupled

Stream-based Cache Bypassing

Ideal Prefetch 1K instructions

ahead

28

Results – Design Space Interaction

0.5

0.6

0.7

0.8

0.9

1

1.1

1 1.5 2 2.5 3

Ener

gy

CortexSuite Speedup

OOO[2,6,8] Pf-Stride[2,6,8]

0.5

0.6

0.7

0.8

0.9

1

1.1

1 1.5 2 2.5 3

Ener

gy

SPEC CPU 2017 Speedup

Pf-Helper[2,6,8] SSP-Cache-Aware[2,6,8]

29

Conclusion

• Stream as a new memory abstraction in ISA.– ISA/Microarchitecture extension.

– Stream-aware cache bypassing.

• New paradigm of memory specialization.– New direction for improving cache architectures.

– Combine memory and computation specialization.

30

Stream-Aware Policies

Compiler (ISA)/Hardware

Memory Footprint

Reuse Distance

Modified?

Conditional Used?

Indirect

…

Prefetch Throttling

Cache Replacement

Cache Bypassing

Sub-Line Transfer

…

Rich Information Better Policies

31

Backup Slides

32

Results – Semi-Binding Prefetching

00.20.40.60.8

1

Remain Insts Added Insts

1

1.5Speedup of Semi-Binding Prefetch vs. Non-Binding Prefetch

33

Related Work• Decouple access execute.

– Outrider [ISCA’11], DeSC [MICRO’15], etc.

– Ours: New ISA abstraction for the access engine.

• Prefetching.– Stride, IMP [MICRO’15], etc.

– Ours: Explicit access pattern in ISA.

• Cache bypassing policy.– Counter-based [ICCD’05], LLC bypassing [ISCA’11], etc.

– Ours: Incorporate static stream information.

34

Stream-Aware Policies – Cache Bypass• Stream: Access Pattern → Precise Memory Footprint.

while (i < N) while (j < N)

while (k < N) sum += a[k][i]

* b[k][j];

Core

L1$

L2$

s_as_b

s_b s_a

a[N][N] b[N][N]

Reuse Dist. Reuse Dist.

35

Results – Unused Stream Requests

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

lda

liblin

ear

mo

tio

n-e

stim

atio

np

carb

msp

hin

xsr

rsv

d3

dis

par

ity

loca

lizat

ion

mse

rm

ult

i_n

cut

sift

stit

chsv

mte

xtu

re_

syn

thes

istr

acki

ng

nam

d_r

par

est_

rp

ovr

ay_r

ble

nd

er_r

per

lben

ch_s

gcc_

sm

cf_s

lbm

_so

mn

etp

p_s

xala

ncb

mk_

sx2

64_

sd

eep

sjen

g_s

imag

ick_

sle

ela_

sn

ab_s

xz_s

avg.

36

Stream ISA Extension – Compiler in LLVM• Identify stream candidates.– Search backwards on operands of candidates.

–Detect pattern and dependency between candidates.

Identify Stream

Candidates

SelectStream

Candidates

Code Generation

• Select stream candidates.– Check decouplable pattern.

– Limited number of pseudo-registers.

• Code generation.– Stream configuration, e.g. type, parameters, etc.

– Transform the program.

37

Stream: A New ISA Memory AbstractionHigher level of abstraction embeds richer semantic info.

Opens up new opportunities.

1.37x speedup over a hardware stride prefetch.

Core

Mem

Acc.(Address, Value):(0x400, 1)(0x404, 2)(0x408, 3)…

Streamin ISA

• What stream patterns to support?• How to deal with control flow?• ISA microarchitecture co-design?• …

• Prefetch with control flow.• Decouple address computation.• Better cache policies.• …

38

Insight: Stream as a new ISA abstraction• Higher level abstraction than single memory requests.

– Access pattern, memory footprint, etc.

– More accurate & efficient than hardware approaches, e.g. hardware prefetching.

• Rich information brings new opportunities:– More accurate & timely prefetching.

– Removing redundant work, e.g. address computation.

– Better cache policies.

39

Stream ISA Extension – Basic Principles• Stream: A decoupled sequence of values/addresses.

• Usage of stream data:– Referred by a pseudo-register.

– Explicitly stepped by a stream_step inst.

Arch. Reg Phys. Reg

Memory 0x408

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

Pseudo-Reg

UserStep.

40

Stream ISA Extension – Nested Loop• Indirect Stream.

• Decoupled from control flow.

• Longer streams: Configure stream in outer loop.

• Coalesce stream.

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph

int i = 0;while (i < M) { int j = 0; while (j < N) { sum += a[i][j]; j++; } i++;}

int i = 0;stream_cfg(s_j, s_a);while (i < M) { while (s_j < N) { sum += s_a; stream_step(s_j); } stream_step(s_j); i++;}stream_end(s_j, s_a);

s_a

s_j

41

Stream ISA Extension – Coalesce Stream• Indirect Stream.

• Decoupled from control flow.

• Longer streams (nested loop).

• Coalesce stream.

Original C Code Stream Decoupled Pseudo Code Stream Dependence Graphint i = 0;while (i < N) { b[i] = a[i].x + a[i].y; i++;}

stream_cfg(s_i, s_a, s_b);while (s_i < N) { s_b = s_a.x + s_a.y; stream_step(s_i);}stream_end(s_i, s_a, s_b);

s_i

s_b s_a

42

Microarchitecture – Core View

Goal: What is the current iteration of a stream?

s step step s step s

0Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

43



s step step s step 0

1Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

44



s step step s step 0

1Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

45



s step step 1 step 0

2Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

46



3 step step 1 step 0

3Stream 1 Iter.

Dispatch Iter. Map

Stream User Inst.

steam_step Inst.

Lookup

Increment

47

Microarchitecture – Revert stream_step• Control misspeculated stream_step.

– Decrement the iteration map.

– No need to flush the FIFO and re-fetch data (decoupled) !

Memory 0x408

Pseudo-Reg Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

UserStep.

48

Stream ISA Extension – Stepping• Explicitly stepping with stream_step instruction.

Memory 0x408

Pseudo-Reg

Memory 0x40c

Memory 0x400

Memory 0x404

Memory 0x410

Stream

UserStep.

49

Microarchitecture – Stream View

Managed by the Stream Engine.

Definition:Address pattern parameters,Initial address, stride, etc.

State:Current iteration.

Operands:Index from cache.For indirect streams.

50

Microarchitecture – Aliasing Streams

• Leverage the LSQ to handle aliasing.– Extend the LSQ with PEB to record prefetching stream elements.

– Insert into the LSQ for the user instruction that semantically triggers the stream access.

– Revert and restart at memory order violation.

51

Stream-Aware Policies - Throttling

Goal: Prefetch just in time.Evenly distribute FIFO among streams leads to untimely prefetching.

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

…

…

Iterations (Stream Elements)

Allocated FIFO Entry Unallocated FIFO Entry

Inst

Stream 1

Stream 2

Not Ready

Ready

52


A “Late” counter per stream.Increased every time a user instruction blocked by the stream.

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

…

…



Inst

Stream 1

Stream 2

Not Ready

Ready

1

0

Late Counter

Increment

53


A “Late” counter per stream.Allocate more FIFO entries to the stream with high “Late” count.

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

…

…



Inst

Stream 1

Stream 2

Not Ready

Ready

1

0

Late Counter

Increment

54

Stream-Aware Policies – Throttling

A “Late” counter per stream.After some iterations, stream 1 has enough entries to hide the latency.

4 5 6 7 8 9 10 11

4 5 6 7 8 9 10 11

…

…



Inst

Stream 1

Stream 2

Not Ready

Ready

0

0

Late Counter

55

Stream-Aware Policies – Cache Bypass

0 1 2 3 4 5 6 7 …



Stream 1

Not Fetching

Fetching

MSHR

0 1

0 1 2 3

L1

L2

Memory

Overhead 1: Caching low reuse data.

Overhead 2: Unused L2 MSHRs.

• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.

– Bypass.

56


0 1 2 3 4 5 6 7 …



Stream 1

Not Fetching

Fetching

MSHR

0 1 2 3

L1 (Bypassed)

L2

Memory

Benefit 1: Do not evict data in L1.

Benefit 2: Increase memory parallelism.

• Stream: Access Pattern -> Precise Memory Footprint.– Precise estimate the memory footprint when configuring streams.

– Bypass streams with low reuse.

57


Goal: Correctly identify streams with low reuse.A stream table to combine static & dynamic information.

Tags are augmented with the stream id.

A FSM to bypass streams with:

✓ High miss count &&

✓ Low reuse count &&

✓ (Large footprint || No footprint info)

58

Results – Stream-Aware Cache Bypassing

3.40

1

1.1

1.2

1.3

1.4

1.5

SSP-Cache-Aware vs. SSP-Semi-Bind

Non-Binding Prefetch

Semi-Binding Prefetch

Stream-AwareCache

59

Documents

Stream-based Memory Specialization for General Purpose ... · – Energy-efficient prefetching. – Offload Memory from expensive OOO Pipeline – Leverage stream information in cache