35
Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired by Chazelle, Simmons, and Teller’s foundational work in this area: Whiplash. 1 Mark Sutherland Ajaykumar Kannan Natalie Enright Jerger {suther68,kannanaj,enright}@ece.utoronto.ca

Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Not Quite My Tempo1

Matching Prefetches to Memory Access Times

1. This work was inspired by Chazelle, Simmons, and Teller’s foundational work in this area: Whiplash.1

Mark SutherlandAjaykumar Kannan

Natalie Enright Jerger

{suther68,kannanaj,enright}@ece.utoronto.ca

Page 2: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

2

Page 3: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

3

Page 4: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

4

Goal: Add temporal information to an SMS prefetcher.

Page 5: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Motivation

5

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"400.P

ERLBENCH"

401.B

ZIP2"

403.G

CC"

410.B

WAVES"

416.G

AMESS"

429.M

CF"

433.M

ILC"

434.Z

EUSM

P"435.G

ROMACS"

436.C

ACTUSADM

"437.L

ESLIE3D"

444.N

AMD"

450.S

OPLEX"

453.P

OVRAY"

454.C

ALCULIX"

456.H

MMER"

458.S

JENG"

462.L

IBQU

ANTUM"

464.H

264REF"

470.L

BM"

471.O

MNETPP"

473.A

STAR"

Frac%o

n(of(Prefetche

s(

Useless(Prefetches(in(SMS(

0%

Page 6: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Our Contributions

1. Propose a framework for timely prefetching in SMS, which utilizes temporal information to filter prefetches in the Pattern History Table.

2. Demonstrate practical improvements over SMS with a 32kB storage budget.

Page 7: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

Source: S. Somogyi, T. F.Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial Memory Streaming. SIGARCH Comput. Archit. News, 2006.

7

Page 8: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher BackgroundRepresentation of a Memory Region

Page Bits Accessed Cache Blocks = 1100110100 … 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1110100101 … 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0100000010 … 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0

• Memory regions (most often the size of a physical page) are bit vectors, where a set bit indicates that the program accessed this block.

8

Page 9: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

9

Core & L1

L2 Cache

Main Memory

1

SMS

Active Generation Table

Pattern History Table

Observe MissesPC+ Offset Cache Blocks

PC1+2 0 0 1 0 0 0 0 0PC1: ld A+2

Page 10: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

10

Core & L1

L2 Cache

Main Memory

1

SMS

Active Generation Table

Pattern History Table

Observe MissesPC+ Offset Cache Blocks

PC1+2 0 0 1 0 0 1 0 0PC1: ld A+5

Page 11: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

11

Core & L1

L2 Cache

Main Memory

1

SMS

Active Generation Table

Pattern History Table

Observe MissesPC+ Offset Cache Blocks

PC1+2 0 0 1 0 0 1 0 0PC2+4 0 0 0 0 1 0 0 0

PC2: ld B+4

Page 12: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

12

Core & L1

L2 Cache

Main Memory

1

SMS

Active Generation Table

Pattern History Table

Observe MissesPC+ Offset Cache Blocks

PC1+2 1 0 1 0 0 1 0 0PC2+4 0 0 0 0 1 0 0 0

PC1: ld A

Page 13: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

13

Core & L1

L2 Cache

Main Memory

SMS

Active Generation Table

Pattern History Table

PC+ Offset Cache Blocks

PC2+4 0 0 0 0 1 0 0 0

L2: evict A

2 Eviction Transfers EntryTo PHT

PC+ Offset Cache Blocks

PC1+2 1 0 1 0 0 1 0 0

Page 14: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

14

Core & L1

L2 Cache

Main Memory

SMS

Active Generation Table

Pattern History Table

PC+ Offset Cache Blocks

PC2+4 0 0 0 0 1 0 0 0

PC1: ld C+2

PC+ Offset Cache Blocks

PC1+2 1 0 1 0 0 1 0 0

3Issue

Prefetches

{C, C+5}

Page 15: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

SMS Prefetcher Background

15

Core & L1

L2 Cache

Main Memory

SMS

Active Generation Table

Pattern History Table

3Issue

Prefetches

{p. block}

1 Observe Misses

2 Eviction Transfers EntryTo PHT

Page 16: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Adding Temporal Information

16

• Problem: In the bit vectors stored by SMS, we do not have any information about the order of the accesses.

• Goal: A mechanism to distinguish various accesses within a spatial region, preferably one that allows us to prefetch blocks in a more timely fashion.

Page 17: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Adding Temporal Information

17

t

• Observe that many of our applications see the same PC generating memory requests with different time deltas between each request.

6 31 82

H H M H H M H

3 25 69

Page 18: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Adding Temporal Information

18

• Add a global counter to the L2 that ticks on every cache miss. • Define access tempo: Delta between two cache accesses.

L2 CacheCounter: 000100

L2 CacheCounter: 000101

Observe CacheMiss?

Page 19: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Adding Temporal Information

19

• Decompose all PHT entries into multiple access tempos.• Store each disjoint tempo in its own “slice” of SMS.

PHT:%0001111001011110100101101"

T1:%0001100001000000000100101" T2:%000011000011110100001000"

…..

Page 20: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Training Based on Tempo

20

• Store the access tempos in on-chip Localized Tempo Buffers (LTBs). • During training, only set the bits of the vector corresponding to the

currently measured tempo.

4 2 5

12 8 9

1 3 1 PC Hash

LTB Age

T0:00000010010" T2:%0100110010"…

AGT Slices

Page 21: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Prefetching Based on Tempo

21

• Simply knowing our current access tempo is not enough to ensure prefetch timeliness.

• Define memory tempo: Delta between when a memory request is issued, and when the block is filled into the cache.

L2 CacheCounter: 000100

MSHR1

MSHRN

.

.

.

Creq 1Store Counter

SnapshotMain Memory

Page 22: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Prefetching Based on Tempo

22

• Simply knowing our current access tempo is not enough to ensure prefetch timeliness.

• Define memory tempo: Delta between when a memory request is issued, and when the block is filled into the cache.

L2 CacheCounter: 0010001

MSHR1

MSHRN

.

.

.

Creq 2

Calculate MemoryTempo

Main Memory

Page 23: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Rushing or Dragging?

23

• Can compare the memory’s tempo against the current PC’s access tempo at prefetch time.

PC > Memory(Rushing)

Memory > PC(Dragging)

• Generating requests faster than memory is returning them.

• Prefetch only slower tempos.

• Memory is fast enough to keep up with our requests.

• Prefetch faster tempos.

Page 24: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Walkthrough Example

24

Tempo&Decoder&PC: Load [B+2]

6

LTB Age

8 4 6

1 2

10

GTB Age

12 9 14 14 9 10 12

3

T2:00000010010" T3:&0100110010"AGT Slices

…. 4

Prefetch “Slow & V. Slow” 5

1. PC Observes Demand Request to Address [B+2]

Page 25: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Walkthrough Example

25

Tempo&Decoder&PC: Load [B+2]

6

LTB Age

8 4 6

1 2

10

GTB Age

12 9 14 14 9 10 12

3

T2:00000010010" T3:&0100110010"AGT Slices

…. 4

Prefetch “Slow & V. Slow” 5

2. Lookup Current Local Access Tempo (6)

Page 26: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Walkthrough Example

26

Tempo&Decoder&PC: Load [B+2]

6

LTB Age

8 4 6

1 2

10

GTB Age

12 9 14 14 9 10 12

3

T2:00000010010" T3:&0100110010"AGT Slices

…. 4

Prefetch “Slow & V. Slow” 5

3. Average All Elements of Global Tempo Buffer (12)

Page 27: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Walkthrough Example

27

Tempo&Decoder&PC: Load [B+2]

6

LTB Age

8 4 6

1 2

10

GTB Age

12 9 14 14 9 10 12

3

T2:00000010010" T3:&0100110010"AGT Slices

…. 4

Prefetch “Slow & V. Slow” 5

4. Decode “Rushing” or “Dragging”.

Page 28: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Walkthrough Example

28

Tempo&Decoder&PC: Load [B+2]

6

LTB Age

8 4 6

1 2

10

GTB Age

12 9 14 14 9 10 12

3

T2:00000010010" T3:&0100110010"AGT Slices

…. 4

Prefetch “Slow & V. Slow” 5

5. Prefetch Matching Tempos

Page 29: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Evaluation

29

Core (OoO) Memory

6-wide, 256 entry instruction window 16kB L1, 128kB L2, Fully inclusive L3

No fetch hazards. 32 entry L2 request queue

Issue 2 loads & 1 store / cycle. Open Row FR-FCFS Mem. Scheduling

tl2 = 10 cycles

Config 1: 1MB L3, 12.8 GB/s memory

Config 2: 256kB L3, 12.8 GB/s memory

Config 3: 1 MB L3, 3.2 GB/s memory

Page 30: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

30

Normal Configuration

Absolute IPC Results

Page 31: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

0"

0.02"

0.04"

0.06"

0.08"

0.1"

0.12"

0.14"

0.16"

0.18"

Baseline" SMS" Tempo"

IPC$for$mcf$

mcf"

20%

Absolute IPC Results

31

Page 32: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Increased Low B/W Performance

32

• Decreasing memory B/W increases Tempo’s gains.

Page 33: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Reduction in Useless Prefetches

33

Better

Page 34: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

Conclusion

34

1. Presented a “sliced” version of SMS that filters prefetching decisions based on repeated PC access times.

2. Achieved 1.45% and 2.57% IPC improvement on Normal and Low Bandwidth configurations.

3. Reduced the number of useless prefetches by 17.6%.

Page 35: Not Quite My Tempo1comparch-conf.gatech.edu/dpc2/resource/dpc2_sutherland... · 2015. 6. 11. · Not Quite My Tempo 1 Matching Prefetches to Memory Access Times 1. This work was inspired

35

?