Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas

Two Ways to Exploit Multi-Megabyte Caches

AENAO Research Group @ TorontoKaveh Aasaraai

Ioana Burcea

Myrto Papadopoulou

Elham Safi

Jason Zebchuk

Andreas Moshovos

{aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

EPFL, Jan. 2008 2Aenao Group/Toronto

Future Caches: Just Larger?

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

1. “Big Picture” Management2. Store Metadata

10s – 100s of MB


Conventional Block Centric Cache

“Small” Blocks Optimizes Bandwidth and Performance

Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture Lost


“Big Picture” View

Region: 2n sized, aligned area of memory Patterns and behavior exposed

Spatial locality

Exploit for performance/area/power

Coarse-Grain View of Memory

L2 Cache


Exploiting Coarse-Grain Patterns

Many existing coarse-grain optimizations Add new structures to track coarse-grain information

CPU

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via

Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched

Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK


L2 Cache

RegionTracker Solution

Manage blocks, but also track and manage regions

Tag Array

L1

L1

L1

L1

Data Array

Data Blocks

BlockRequests

Block Requests

RegionTracker

RegionProbes

RegionResponses


RegionTracker Summary

Replace conventional tag array: 4-core CMP with 8MB shared L2 cache Within 1% of original performance Up to 20% less tag area Average 33% less energy consumption

Optimization Framework: Stealth Prefetching: same performance, 36% less area RegionScout: 2x more snoops avoided, no area overhead


Road Map

Introduction

Goals

Coarse-Grain Cache Designs

RegionTracker: A Tag Array Replacement

RegionTracker: An Optimization Framework

Conclusion


Goals

1. Conventional Tag Array Functionality Identify data block location and state Leave data array un-changed

2. Optimization Framework Functionality Is Region X cached? Which blocks of Region X are cached? Where? Evict or migrate Region X Easy to assign properties to each Region



Increased BW, Decreased hit-rates

Region X

Large Block Size

Tag Array Data Array


Sector Cache

Decreased hit-rates

Region X



Sector Pool Cache

High Associativity (2 - 4 times)

Region X



Decoupled Sector Cache

Region information not exposed Region replacement requires scanning multiple entries

Region X

Tag Array Data ArrayStatus Table


Design Requirements

Small block size (64B) Miss-rate does not increase Lookup associativity does not increase No additional access latency

(i.e., No scanning, no multiple block evictions)

Does not increase latency, area, or energy Allows banking and interleaving

Fit in conventional tag array “envelope”



L1

L1

L1

L1

Data Array

3 SRAM arrays, combined smaller than tag array

RegionVectorArray

BlockStatusTable

EvictedRegionBuffer


Basic Structures

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

1 4

Block Status Table(BST)

status

3 2

Address: specific RVA set and BST set RVA entry: multiple, consecutive BST sets BST entry: one of four RVA sets

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region


Common Case: Hit

Region Tag RVA Index Region OffsetBlock Offset49 061021

Address:


Region Tag ……

block0

block15

wayV

Block Offset19 6 0


1 4

status

3 2

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region


Worst Case (Rare): Region Miss

Region Tag RVA Index Region OffsetBlock Offset

49 061021

Address:


Region Tag ……

block0

block15

wayV

Block Offset19 6 0


status

3

Ptr

2

Data Array + BST Index

EvictedRegionBuffer(ERB)No

Match!

Ptr


Methodology

Flexus simulator from CMU SimFlex group Based on Simics full-system simulator

4-core CMP modeled after Piranha Private 32KB, 4-way set-associative L1 caches Shared 8MB, 16-way set-associative L2 cache 64-byte blocks

Miss-rates: Functional simulation of 2 billion instructions per core Performance and Energy: Timing simulation using SMARTS sampling

methodology Area and Power: Full custom implementation on 130nm commercial

technology 9 commercial workloads:

WEB: SpecWEB on Apache and Zeus OLTP: TPC-C on DB2 and Oracle DSS: 5 TPC-H queries on DB2

Interconnect

L2

P

D$ I$

P

D$ I$

P

D$ I$

P

D$ I$


Miss-Rates vs. Area

Sector Cache: 512KB sectors, SPC and RT: 1KB regions Trade-offs comparable to conventional cache

0.99

1

1.01

1.02

1.03

1.04

1.05

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Sector Pool Cache

RegionTracker

Conventional Tags

better

Rela

tive M

iss-

Rate

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way 15-way

52-way

48-way


Performance & Energy

0.97

0.98

0.99

1.00

1.01

1.02

1.03

WEB OLTP DSS0%

10%

20%

30%

40%

50%

WEB OLTP DSS

12-way set-associative RegionTracker: 20% less area Error bars: 95% confidence interval

Performance within 1%, with 33% tag energy reduction

Norm

aliz

ed E

xecu

tion T

ime

better

Reduct

ion in T

ag E

nerg

y

better

Performance Energy


Road Map

Introduction

Goals




Conclusion



L1

L1

L1

L1

RVA

ERB

Data Array

BST

Stealth Prefetching:Average 20% performance improvement

Drop-in RegionTracker for 36% less area overhead

RegionScout:In-depth analysis


Snoop Coherence: Common Case

Main Memory

CPU CPU CPURead x

mis

sm

iss

Read x+1Read x+2Read x+n

Many snoops are to non-shared regions


RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss

Non-Shared Regions Locally Cached Regions

Read xRead x

RegionMiss

MissMiss


RegionTracker Implementation

Minimal overhead to support RegionScout optimization

Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVA


RegionTracker + RegionScout

0%

10%

20%

30%

40%

50%

60%

RS 7KB RS 12KB RS 22KB RSRT

Reduct

ion in

Snoop B

roadca

sts

better

4 processors, 512KB L2 Caches 1KB regions

Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag

array

BlockScout(4KB)


Result Summary

Replace Conventional Tag Array: 20% Less tag area 33% Less tag energy Within 1% of original performance

Coarse-Grain Optimization Framework: 36% reduction in area overhead for Stealth Prefetching Filter 41% of snoop broadcasts with no area overhead

compared to conventional cache

Predictor Virtualization

Ioana Burcea

Joint work with

Stephen Somogyi

Babak Falsafi



Interconnect

L2

CPU CPU

L1-D

L1-I

CPU

L1-D

L1-I

Main Memory

Optimization Engines: Predictors

CPU CPU CPU

L1-D

L1-I

CPU CPU

L1-D L1-I

CPU

L1-D

L1-I

CPU CPU CPUCPU CPU

L1-D

L1-IL1-DL1-IL1-DL1-IL1-D


Motivating Trends

Dedicating resources to predictors hard to justify: Chip multiprocessors

Space dedicated to predictors X #processors Larger predictor tables

Increased performance

Memory hierarchies offer the opportunity Increased capacity How many apps really use the space?

Use conventional memory hierarchies to store predictor information


PV Architecture contd.

Optimization Engine

Predictor Table

request predictionrequest



Optimization Engine

prediction


request



Optimization Engine

prediction

+

indexPVStart

PVCache MSHR

PVProxy

L2

Main MemoryPVTable

request

On the backside of the L1


To Virtualize Or Not to Virtualize?

1. Re-Use2. Predictor Info Prefetching

Common Case

CPU

I$ D$

interconnect

Main Memory

L2/L3

Infrequent


To Virtualize or Not?

Challenge Hit in the PVCache most of the time

Will not work for all predictors out of the box

Reuse is necessary Intrinsic

Easy to virtualize Non-intrinsic

Must be engineered

More so if the predictor needs to be fast to start with


Will There Be Reuse?

Intrinsic: Multiple [predictions per entry We’ll see an example

Can be engineered Group temporally correlated entries together:

Cache block

CPU

I$ D$

interconnect

Main Memory

L2/L3


Spatial Memory Streaming

Footprint: Blocks accessed per memory region

Predict next time the footprint will be the same Handle: PC + offset within region


Spatial Generations


Virtualizing SMS

Detector Predictor

patterns

patterns

prefetchestrigger access

Virtualize


Virtualizing SMS

VirtualTable1K

11

PVCache8

11

tag pattern

tag tagpattern

pattern0 11 43 54 85 unused


Packing Entries in One Cache Block

Index: PC + offset within spatial group PC →16 bits 32 blocks in a spatial group → 5 bit offset

→ 32 bit spatial pattern

Pattern table: 1K sets 10 bits to index the table → 11 bit tag

Cache block: 64 bytes 11 entries per cache block → Pattern table

1K sets – 11-way set associative

21 bit index

tag pattern

tag tagpattern

pattern0 11 43 54 85 unused


Memory Address Calculation

+000000

16 bits 5 bits

10 bits

PV Start Address

PC Block offset

Memory Address


Simulation Infrastructure

SimFlex: CMU Impetus Full-system simulator based on Simics

Base processor configuration 8-wide OoO 256-entry ROB / 64-entry LSQ L1D/L1I 64KB 4-way set-associative UL2 8MB 16-way set-associative

Commercial workloads TPC-C: DB2 and Oracle TPC-H: Query 1, Query 2, Query 16, Query 17 Web: Apache and Zeus


SMS – Performance Potential

0

20

40

60

80

100

120

140

Infin

ite1

K -

16

a1

K -

11

a5

12

-11

a2

56

-11

a1

28

-11

a6

4-1

1a

32

-11

a1

6 -

11

a8

- 1

1a

Infin

ite1

K -

16

a1

K -

11

a5

12

-11

a2

56

-11

a1

28

-11

a6

4-1

1a

32

-11

a1

6 -

11

a8

- 1

1a

Infin

ite1

K -

16

a1

K -

11

a5

12

-11

a2

56

-11

a1

28

-11

a6

4-1

1a

32

-11

a1

6 -

11

a8

- 1

1a

Apache Oracle Qry 17

Pe

rce

nta

ge

L1

Re

ad

Mis

se

s (

%)

Covered Uncovered Overpredictions

better


Virtualized Spatial Memory Streaming

-100

1020304050607080

Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17

Per

cent

age

Spe

edup

SMS - 1K sets SMS - 8 sets SMS - PVCache 8 sets

Original Prefetcher: Cost: 60KB

Virtualized Prefetcher: Cost: <1Kbyte

Nearly Identical Performance

better


Impact of Virtualization on L2 Misses

0

0.5

1

1.5

2

2.5

Apache Oracle Qry 17Per

cen

tag

e In

crea

se L

2 M

isse

s

PV-8 PV-16 PV-32


Impact of Virtualization on L2 Requests

0

10

20

30

40

50

Apache Oracle Qry 17

Perc

enta

ge In

crea

se L

2 Re

ques

ts

PV-8 PV-16 PV-32

Coarse-Grain Tracking

Jason Zebchuk

Documents

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas