91
Snoop Filtering and Coarse-Grain Memory Tracking Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors

Snoop Filtering and Coarse-Grain Memory Tracking

  • Upload
    marla

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Snoop Filtering and Coarse-Grain Memory Tracking. Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors. JETTY Snoop-Filtering for Reduced Power in SMP Servers. - PowerPoint PPT Presentation

Citation preview

Page 1: Snoop Filtering  and  Coarse-Grain Memory Tracking

Snoop Filtering and

Coarse-Grain Memory TrackingAndreas MoshovosUniv. of Toronto/ECE

Short Course at the University of Zaragoza, July 2009Some slides by J. Zebchuk or the original paper authors

Page 2: Snoop Filtering  and  Coarse-Grain Memory Tracking

JETTY Snoop-Filtering for Reduced Power in SMP

ServersAndreas Moshovos

Babak Falsafi, ECE, Carnegie MellonGokhan Memik, ECE, Northwestern

Alok Choudhary, ECE, Northwestern

Int’l Conference on High-Performance Architecture, 2001

Page 3: Snoop Filtering  and  Coarse-Grain Memory Tracking

Power is Becoming Important• Architecture is a science of tradeoffs• Thus far:

Performance vs. Cost vs. Complexity• Today:

vs. Power

• Where?– Mobile Devices– Desktops/Servers Our Focus

Page 4: Snoop Filtering  and  Coarse-Grain Memory Tracking

Power-Aware Servers• Revisit the design of SMP servers

– 2 or more CPUs per machine– Snoop coherence-based

• Why?– File, web, databases, your typical desktop– Cost effective too

• This work - a first step:Power-Aware Snoopy-Coherence

Page 5: Snoop Filtering  and  Coarse-Grain Memory Tracking

Power-Aware Snoop-Coherence• Conventional

– All L2 caches snoop all memory traffic– Power expended by all on any memory access

• Jetty-Enhanced– Tiny structure on L2-backside– Filters most “would-be-misses”– Less power expended on most snoop misses– No changes to protocol necessary– No performance loss

Page 6: Snoop Filtering  and  Coarse-Grain Memory Tracking

Roadmap• Why Power is a Concern for Servers?• Snoopy-Coherence Basics• An Opportunity for Reducing Power• JETTY• Results• Summary

Page 7: Snoop Filtering  and  Coarse-Grain Memory Tracking

Why is Power Important?Power Could Ultimately Limit Performance

• Power Demands have been increasing• Deliver Energy to and on chip• Dissipate Heat• Limit:

– Amount of resources & frequency– Feasibility

• Cooling a solution: Cost & Integration?Reducing Power Demands is much more convenient

Page 8: Snoop Filtering  and  Coarse-Grain Memory Tracking

What can be done?• Redesign Circuits• Clock Gating and Frequency Scaling

– A lot has been done thus far– Still active

• Rethink Architectural Decisions– Orthogonal to others

Reduce Power Under Performance Constraints

Page 9: Snoop Filtering  and  Coarse-Grain Memory Tracking

The “Silver Bullet” Solution• Good if there was one• However, till one is found...

• Look at all structures• Rethink Design• Propose Power-Optimized versions

• This is what we’re doing for performance

Page 10: Snoop Filtering  and  Coarse-Grain Memory Tracking

Snoopy Cache Coherence

All L2 tags see all bus accessesIntervene when necessary

Main Memory

CPU Core

L1

L2

CPU Core

Hit

Page 11: Snoop Filtering  and  Coarse-Grain Memory Tracking

How About Power?

All L2 tags see all bus accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Main Memory

L1

L2

CPU CoreCPU Core CPU Core

missmiss

Page 12: Snoop Filtering  and  Coarse-Grain Memory Tracking

JETTY: A Would be Snoop-Miss Filter

Imprecise: May filter a would-be miss Never filters snoop-hits

JETTY

addr

Not here!

CPU n

Would be Snoop-Miss:

JETTY

addr

Don’t Know

CPU n

Would be Snoop-Hit:

Detect most misses using fewer resources

Page 13: Snoop Filtering  and  Coarse-Grain Memory Tracking

Potential for Savings Exist

• Most Snoops miss– 91% AVG

• Many L2 accesses are due to Snoop Misses– 55% AVG

• Sizeable Potential Power Savings:– 20% - 50% of total L2 power

Page 14: Snoop Filtering  and  Coarse-Grain Memory Tracking

Exclude-Jetty

• Subset of what is not cached

cached

not cached

How? Cache recent snoop-misses locally

ExcludeJETTY

Page 15: Snoop Filtering  and  Coarse-Grain Memory Tracking

Exclude-Jetty

• Subset of what you don’t have

Works well for producer-consumer

Page 16: Snoop Filtering  and  Coarse-Grain Memory Tracking

Include-Jetty

• Superset of what is cached

cached

not cached

How? Well...

includeJETTY

Page 17: Snoop Filtering  and  Coarse-Grain Memory Tracking

Include-Jetty

address

bit vector 0

bit vector 1

bit vector 2

f( )

h( )

g( )

• Not-CachedAny zero bit

• May be CachedAll bits set

Later I was told this is a Bloom filter…

Page 18: Snoop Filtering  and  Coarse-Grain Memory Tracking

Include-Jetty

• Superset of what you have

This is a counting bloom filter:L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

Partial overlapping indexes worked better

Page 19: Snoop Filtering  and  Coarse-Grain Memory Tracking

Hybrid-Jetty• Some cases Exclude-J works well• Some other Include-J is better• Combine

– Access in parallel on snoop– Allocation

• IJ always• If IJ fails to filter then to EJ• EJ coverage increases

Page 20: Snoop Filtering  and  Coarse-Grain Memory Tracking

Latency?• Jetty may increase snoop-response time• Can only be determined on a design by design basis• Largest Jetty:

– Five 32x32 bit register files

Page 21: Snoop Filtering  and  Coarse-Grain Memory Tracking

Results• Used SPLASH-II

– Scientific applications– “Large” Datasets

• e.g., 4-80Megs of main memory allocated• Access Counts: 60M-1.7B

– 4-way SMP, MOESI– 1M direct-mapped L2, 64b 32b subblocks– 32k direct-mapped L1, 32b blocks

• Coverage & Power (analytical model)

Page 22: Snoop Filtering  and  Coarse-Grain Memory Tracking

Coverage: Hybrid-Jetty

• Can capture 74% of all snoop-misses

bette

r

0%

20%

40%

60%

80%

100%

ba ch em ff fm lu oc ra rt un AVG10x4x7 + 32x4 9x4x7 + 32x4 8x4x7 + 32x4

Page 23: Snoop Filtering  and  Coarse-Grain Memory Tracking

Power-Savings

• 28% of overall L2 power

0%

10%

20%

30%

40%

50%

ba ch em ff fm oc ra rt un AVG

bette

r

Page 24: Snoop Filtering  and  Coarse-Grain Memory Tracking

Summary• Power is becoming important

– Performance, Reliability and Feasibility• Unique Opportunities Exist for Servers

• JETTY: Filter Snoops that would miss– 74% of all snoops– 28% of L2 power saved– No protocol changes– No performance loss

Page 25: Snoop Filtering  and  Coarse-Grain Memory Tracking

Power efficient cache coherence

C. Saldanha, M. LipastiWorkshop on Memory Performance Issues

(in conjunction with ISCA), June 2001.

Page 26: Snoop Filtering  and  Coarse-Grain Memory Tracking

MEMORY

Serial Snooping• Avoids Speculative transmission of Snoop packets.• Check the nearest neighbor• Data supplied with minimum latency and power

Page 27: Snoop Filtering  and  Coarse-Grain Memory Tracking

TLB and Snoop Energy-Reduction using Virtual Caches in

Low-Power Chip-Multiprocessors

Magnus Ekman, *Fredrik Dahlgren, and Per Stenström

Chalmers University of TechnologyEricsson Mobile Platforms

Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002

Page 28: Snoop Filtering  and  Coarse-Grain Memory Tracking

Page Sharing Tables• On snoop requesting node gets a page-level sharing vector

Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

If a PST entry is evicted the whole page must be evicted

Page 29: Snoop Filtering  and  Coarse-Grain Memory Tracking

29

RegionScout: Exploiting Coarse Grain Sharing in Snoop

Coherence

Andreas [email protected]

Int’l Conference on Computer Architecture 2005

Page 30: Snoop Filtering  and  Coarse-Grain Memory Tracking

30

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Improving Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

Page 31: Snoop Filtering  and  Coarse-Grain Memory Tracking

31

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

RegionScout: Avoid Some Snoops

• Frequent case: non-sharing even at a coarse level/Region• RegionScout: Dynamically Identify Non-Shared Regions

– First Request to a Region Identifies it as not Shared– Subsequent Requests do not need to be broadcast

• Uses Imprecise Information– Small structures– Layer on top of conventional coherence– No additional constraints

Page 32: Snoop Filtering  and  Coarse-Grain Memory Tracking

32

Roadmap• Conventional Coherence:

– The need for power-aware designs

• Potential: Program Behavior

• RegionScout: What and How

• Implementation

• Evaluation

• Summary

Page 33: Snoop Filtering  and  Coarse-Grain Memory Tracking

33

Coherence Basics

• Given request for memory block X (address)• Detect where its current value resides

Main Memory

snoopsnoop

X

hit

CPU CPU CPU

Page 34: Snoop Filtering  and  Coarse-Grain Memory Tracking

34

Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accessesBandwidth: broadcast all coherent requests

Main Memory

L2

CPU

missmiss

CPU CPU

Page 35: Snoop Filtering  and  Coarse-Grain Memory Tracking

35

RegionScout Motivation: Sharing is Coarse

• Region: large continuous memory area, power of 2 size• CPU X asks for data block in region R

1. No one else has X2. No one else has any block in R

RegionScout Exploits this BehaviorLayered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses

Page 36: Snoop Filtering  and  Coarse-Grain Memory Tracking

Optimization Opportunities

• Power and Bandwidth– Originating node: avoid asking others– Remote node: avoid tag lookup

CPU

I$ D$

CPU

I$ D$

Memory

SWITCH

CPU

I$ D$

Page 37: Snoop Filtering  and  Coarse-Grain Memory Tracking

Potential: Region Miss Frequency

0%

25%

50%

75%

100%

256 512 1K 2K 4K 8K 16K

p4.512K

p4.1M

p8.512K

p8.1M

% o

f all

requ

ests

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

bette

r

Glo

bal R

egio

n M

isse

s

Page 38: Snoop Filtering  and  Coarse-Grain Memory Tracking

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

3

Record: Non-Shared Regions Record: Locally Cached Regions

Page 39: Snoop Filtering  and  Coarse-Grain Memory Tracking

RegionScout at Work: Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

1

2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 40: Snoop Filtering  and  Coarse-Grain Memory Tracking

RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

12 2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 41: Snoop Filtering  and  Coarse-Grain Memory Tracking

• Requesting Node provides address:

• At Originating Node – from CPU: – Have I discovered that this region is not shared?

• At Remote Nodes – from Interconnect: – Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

CPU

address

Page 42: Snoop Filtering  and  Coarse-Grain Memory Tracking

Remembering Non-Shared Regions

• Records non-shared regions• Lookup by Region portion prior to issuing a request• Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments

Page 43: Snoop Filtering  and  Coarse-Grain Memory Tracking

What Regions are Locally Cached?

• If we had as many counters as regions:– Block Allocation: counter[region]++– Block Eviction: counter[region]--– Region cached only if counter[region] non-zero

• Not Practical:– E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter

Page 44: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©What Regions are Locally Cached?

• Use few Counters Imprecise: – Records a superset of locally cached Regions– False positives: lost opportunity, correctness preserved

Region Tag offset

counter

hashCached Region Hash

“Counter”: + on block allocation - on block evictionFew entries, e.g., 256

p bits

P-bit 1 if counter non-zero used for lookups

Page 45: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Roadmap• Conventional Coherence

• Program Behavior: Region Miss Frequency

• RegionScout

• Evaluation

• Summary

Page 46: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Evaluation Overview• Methodology

• Filter rates– Practical Filters can capture many Region Misses

• Interconnect bandwidth reduction

Page 47: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Methodology• In-House simulator based on Simplescalar

– Execution driven– All instructions simulated – MIPS like ISA– System calls faked by passing them to host OS– Synchronization using load-linked/store-conditional– Simple in-order processors– Memory requests complete instantaneously– MESI snoop coherence– 1 or 2 level memory hierarchy– WATTCH power models

• SPLASH II benchmarks– Scientific workloads– Feasibility study

Page 48: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Filter Rates

0%

25%

50%

75%

100%

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16KIden

tifie

dG

loba

l Reg

ion

Mis

ses

CRH Size

bette

r

For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential

Page 49: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Bandwidth Reduction

0%

25%

50%

75%

100%

2K 4K 8K 16K

p4.512K

p8.512K

p4.64K

p8.64K

Mes

sage

s

Region Size

bette

r

CM

P

Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)

Page 50: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©Related Work• RegionScout

– Technical Report, Dec. 2003

• Jetty– Moshovos, Memik, Falsafi, Choudhary, HPCA 2001

• PST– Eckman, Dahlgren, and Stenström, ISLPED 2002

• Coarse-Grain Coherence– Cantin, Lipasti and Smith, ISCA 2005

Page 51: Snoop Filtering  and  Coarse-Grain Memory Tracking

Moshovos ©

51

Summary• Exploit program behavior/optimize a frequent case

– Many requests result in a global region miss

• RegionScout– Practical filter mechanism– Dynamically detect would-be region misses– Avoid broadcasts– Save tag lookup power and interconnect bandwidth – Small structures– Layered extension over existing mechanisms– Invisible to programmer and the OS

Page 52: Snoop Filtering  and  Coarse-Grain Memory Tracking

Coarse-Grain Coherence

J. Cantin, M. Lipasti and J. E. SmithISCA 2005

Page 53: Snoop Filtering  and  Coarse-Grain Memory Tracking

Coarse-Grain Coherence• Exploits the same phenomenon as RegionScout• Protocol extended to keep track of region state as well

– Additional optimizations• Uses an additional region tag array to do so• Region replacements

– Must scan and find the block and evict them

Page 54: Snoop Filtering  and  Coarse-Grain Memory Tracking

Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring

multiprocessorsK. Strauss, X. Shen, J. Torrellas

International Symposium on Computer Architecture, June 2006.

Page 55: Snoop Filtering  and  Coarse-Grain Memory Tracking

Karin Strauss

Flexible

Snoopi

ng

55

Predictors and algorithms

snoopforwardExact

forward then snoop

Aggforward

snoopforward then snoop

Subset

action on positive prediction

action on negative prediction

predictor / algorithm

Superset

Con snoop then forward

node can supply

in predictor

set of addresses:

Ring-specific

Page 56: Snoop Filtering  and  Coarse-Grain Memory Tracking

Karin Strauss

Flexible

Snoopi

ng

56

Predictor implementation

• Subset– associative table:

subset of addresses that can be supplied by node

• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):

addresses that recently suffered false positives

• Exact– associative table: all addresses that can be supplied by node– downgrading: if address has to be evicted from predictor table,

corresponding line in node has to be downgraded

Page 57: Snoop Filtering  and  Coarse-Grain Memory Tracking

Design and Implementation of the Blue Gene/P Snoop Filter

Valentina Salapura, Matthias Blumrich, Alan Gara

Int’l Conf. on High-Performance Computer Architecture, 2008

Page 58: Snoop Filtering  and  Coarse-Grain Memory Tracking
Page 59: Snoop Filtering  and  Coarse-Grain Memory Tracking

Three Mechanisms• Stream registers

– Contiguous data areas– Adaptive to cache arbitrarily sized contiguous regions with a single

register– Stream registers track strided and sequential streams

• Snoop caches– Cache of recently executed snoop requests– Multiple requests to same line do not have to cause multiple

snoop lookups– Snoop caches track locality

• Range filter– Identify regions of known non-shared data– Configured by software

Page 60: Snoop Filtering  and  Coarse-Grain Memory Tracking

Stream Registers• Base = where the block starts• Mask = which bits are common

– Example: base 0111 mask 1101 01X1 may be in the cache• Over time Mask becomes all zeros• How to reset?• Cache Wrap

– Each set uses Round-Robin replacement– Count replacements per set– Cache wrap when all counters > ways– Copy all streams to history and use combination– Next time throw out history

Page 61: Snoop Filtering  and  Coarse-Grain Memory Tracking

Stream Registers: An Example• Direct mapped cache with two blocks

• At this point the filter reports that the cache contains:– 001 and 011– 101 and 111

• The first two are not there• Eventually the filter becomes

saturated and can filter much• How can we get rid of the 011 /

1x1?

empty

empty

001

empty

empty

empty

001 / 111empty

001

011

001 / 1X1empty

101

011

001 / 111

101 / 111

101

111

001 / 1X1

101 / 1X1

Tim

e

cache Stream registers

Page 62: Snoop Filtering  and  Coarse-Grain Memory Tracking

Avoiding Saturation: Exploiting Cache Warping

empty

empty

001

empty

empty

empty

001 / 111empty

001

011

001 / 1X1empty

101

011

empty101 / 111

101

111

empty

101 / 1X1

Tim

ecache Stream registers

empty

empty

empty

empty

001 / 1X1empty

001 / 1X1

empty

001 / 1X1

empty

Shadow

Cache Warp Can discard Shadow

Page 63: Snoop Filtering  and  Coarse-Grain Memory Tracking

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip

MultiprocessorsChinnakrishnan S. Ballapuram

Ahmad SharifHsien-Hsin S. Lee

ASPLOS 2008

Page 64: Snoop Filtering  and  Coarse-Grain Memory Tracking

Software-Hardware Hybrid• Software Directs hardware what to do

– Mechanisms very similar to Jetty and RegionScout

• Paper incorrectly states that:– Jetty does not work for CMPs

• It does not work well for small scale CMPs– RegionScout works only for busses

• Is interconnect agnostic

Page 65: Snoop Filtering  and  Coarse-Grain Memory Tracking

RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip

Memory HierarchyJason Zebchuk, Elham Safi and Andreas Moshovos

Int’l Symposium on Microarchitecture, 2007

Page 66: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

66Aenao Group/Toronto

Future Caches: Just Larger?

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

1. “Big Picture” Management2. Store Metadata

10s – 100s of MB

Page 67: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

67Aenao Group/Toronto

Conventional Block Centric Cache

• “Small” Blocks– Optimizes Bandwidth and Performance

• Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture Lost

Page 68: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

68Aenao Group/Toronto

“Big Picture” View

• Region: 2n sized, aligned area of memory• Patterns and behavior exposed

– Spatial locality

• Exploit for performance/area/power

Coarse-Grain View of Memory

L2 Cache

Page 69: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

69Aenao Group/Toronto

Exploiting Coarse-Grain Patterns

• Many existing coarse-grain optimizations• Add new structures to track coarse-grain information

CPU

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK

Virtual Tree CoherencePower-Efficient DRAMSpeculation

Page 70: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

70Aenao Group/Toronto

L2 Cache

RegionTracker Solution

Manage blocks, but also track and manage regions

Tag Array

L1

L1

L1

L1

Data Array

Data Blocks

BlockRequests

Block Requests

RegionTracker

RegionProbes

RegionResponses

Page 71: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

71Aenao Group/Toronto

RegionTracker Summary• Replace conventional tag array:

– 4-core CMP with 8MB shared L2 cache– Within 1% of original performance– Up to 20% less tag area– Average 33% less energy consumption

• Optimization Framework:– Stealth Prefetching: same performance, 36% less area– RegionScout: 2x more snoops avoided, no area overhead

Page 72: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

72Aenao Group/Toronto

Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion

Page 73: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

73Aenao Group/Toronto

Goals1. Conventional Tag Array Functionality

– Identify data block location and state– Leave data array un-changed

2. Optimization Framework Functionality– Is Region X cached?– Which blocks of Region X are cached? Where?– Evict or migrate Region X– Easy to assign properties to each Region

Page 74: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

74Aenao Group/Toronto

Coarse-Grain Cache Designs

• Increased BW, Decreased hit-rates

Region X

Large Block SizeTag Array Data Array

Page 75: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

75Aenao Group/Toronto

Sector Cache

• Decreased hit-rates

Region X

Tag Array Data Array

Page 76: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

76Aenao Group/Toronto

Sector Pool Cache

• High Associativity (2 - 4 times)

Region X

Tag Array Data Array

Page 77: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

77Aenao Group/Toronto

Decoupled Sector Cache

• Region information not exposed• Region replacement requires scanning multiple

entries

Region X

Tag Array Data ArrayStatus Table

Page 78: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

78Aenao Group/Toronto

Design Requirements• Small block size (64B)• Miss-rate does not increase• Lookup associativity does not increase• No additional access latency

– (i.e., No scanning, no multiple block evictions)

• Does not increase latency, area, or energy• Allows banking and interleaving

• Fit in conventional tag array “envelope”

Page 79: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

79Aenao Group/Toronto

RegionTracker: A Tag Array Replacement

L1

L1

L1

L1

Data Array

• 3 SRAM arrays, combined smaller than tag array

RegionVectorArray

BlockStatusTable

EvictedRegionBuffer

Page 80: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

80Aenao Group/Toronto

Common Case: Hit

Region Tag RVA Index Region Offset Block Offset49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

1 4

status

3 2

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

Page 81: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

81Aenao Group/Toronto

Worst Case (Rare): Region Miss

Region Tag RVA Index Region Offset Block Offset

49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

status

3

Ptr

2

Data Array + BST Index

EvictedRegionBuffer(ERB)No

Match!

Ptr

Page 82: Snoop Filtering  and  Coarse-Grain Memory Tracking

82Aenao Group/Toronto

Methodology• Flexus simulator from CMU SimFlex group

– Based on Simics full-system simulator• 4-core CMP modeled after Piranha

– Private 32KB, 4-way set-associative L1 caches– Shared 8MB, 16-way set-associative L2 cache– 64-byte blocks

• Miss-rates: Functional simulation of 2 billion instructions per core• Performance and Energy: Timing simulation using SMARTS sampling methodology• Area and Power: Full custom implementation on 130nm commercial technology• 9 commercial workloads:

– WEB: SpecWEB on Apache and Zeus– OLTP: TPC-C on DB2 and Oracle– DSS: 5 TPC-H queries on DB2

Interconnect

L2

PD$ I$

PD$ I$

PD$ I$

PD$ I$

Page 83: Snoop Filtering  and  Coarse-Grain Memory Tracking

83Aenao Group/Toronto

Miss-Rates vs. Area

• Sector Cache: 512KB sectors, SPC and RT: 1KB regions• Trade-offs comparable to conventional cache

0.99

1

1.01

1.02

1.03

1.04

1.05

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Sector Pool Cache

RegionTracker

Conventional Tags

better

Rel

ativ

e M

iss-

Rat

e

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way 15-way

52-way

48-way

Page 84: Snoop Filtering  and  Coarse-Grain Memory Tracking

84Aenao Group/TorontoEPFL, Jan. 2008

Performance & Energy

0.97

0.98

0.99

1.00

1.01

1.02

1.03

WEB OLTP DSS0%

10%

20%

30%

40%

50%

WEB OLTP DSS

• 12-way set-associative RegionTracker: 20% less area• Error bars: 95% confidence interval

• Performance within 1%, with 33% tag energy reduction

Nor

mal

ized

Exe

cutio

n Ti

me

better

Red

uctio

n in

Tag

Ene

rgy better

Performance Energy

Page 85: Snoop Filtering  and  Coarse-Grain Memory Tracking

85Aenao Group/Toronto

Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion

Page 86: Snoop Filtering  and  Coarse-Grain Memory Tracking

86Aenao Group/Toronto

RegionTracker: An Optimization Framework

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Stealth Prefetching:Average 20% performance improvementDrop-in RegionTracker for 36% less area overhead

RegionScout:In-depth analysis

Page 87: Snoop Filtering  and  Coarse-Grain Memory Tracking

87Aenao Group/Toronto

Snoop Coherence: Common Case

Main Memory

CPU CPU CPURead x

missmiss

Read x+1Read x+2Read x+n

Many snoops are to non-shared regions

Page 88: Snoop Filtering  and  Coarse-Grain Memory Tracking

88Aenao Group/Toronto

RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss

Non-Shared Regions Locally Cached Regions

Read x

RegionMiss

MissMiss

Page 89: Snoop Filtering  and  Coarse-Grain Memory Tracking

89Aenao Group/Toronto

RegionTracker Implementation

• Minimal overhead to support RegionScout optimization

• Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVA

Page 90: Snoop Filtering  and  Coarse-Grain Memory Tracking

90Aenao Group/Toronto

RegionTracker + RegionScout

RS 7KB RS 12KB RS 22KB RSRT0%

10%

20%

30%

40%

50%HMEAN

Red

uctio

n in

Sno

op B

road

cast

s

better

4 processors, 512KB L2 Caches 1KB regions

Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag array

Page 91: Snoop Filtering  and  Coarse-Grain Memory Tracking

EPFL, Jan. 2008

91Aenao Group/Toronto

Result Summary• Replace Conventional Tag Array:

– 20% Less tag area– 33% Less tag energy– Within 1% of original performance

• Coarse-Grain Optimization Framework:– 36% reduction in area overhead for Stealth Prefetching– Filter 41% of snoop broadcasts with no area overhead compared

to conventional cache