Snoop Filtering and Coarse-Grain Memory Tracking

Snoop Filtering and

Coarse-Grain Memory TrackingAndreas MoshovosUniv. of Toronto/ECE

Short Course at the University of Zaragoza, July 2009Some slides by J. Zebchuk or the original paper authors

JETTY Snoop-Filtering for Reduced Power in SMP

ServersAndreas Moshovos

Babak Falsafi, ECE, Carnegie MellonGokhan Memik, ECE, Northwestern

Alok Choudhary, ECE, Northwestern

Int’l Conference on High-Performance Architecture, 2001

Power is Becoming Important• Architecture is a science of tradeoffs• Thus far:

Performance vs. Cost vs. Complexity• Today:

vs. Power

• Where?– Mobile Devices– Desktops/Servers Our Focus

Power-Aware Servers• Revisit the design of SMP servers

– 2 or more CPUs per machine– Snoop coherence-based

• Why?– File, web, databases, your typical desktop– Cost effective too

• This work - a first step:Power-Aware Snoopy-Coherence

Power-Aware Snoop-Coherence• Conventional

– All L2 caches snoop all memory traffic– Power expended by all on any memory access

• Jetty-Enhanced– Tiny structure on L2-backside– Filters most “would-be-misses”– Less power expended on most snoop misses– No changes to protocol necessary– No performance loss

Roadmap• Why Power is a Concern for Servers?• Snoopy-Coherence Basics• An Opportunity for Reducing Power• JETTY• Results• Summary

Why is Power Important?Power Could Ultimately Limit Performance

• Power Demands have been increasing• Deliver Energy to and on chip• Dissipate Heat• Limit:

– Amount of resources & frequency– Feasibility

• Cooling a solution: Cost & Integration?Reducing Power Demands is much more convenient

What can be done?• Redesign Circuits• Clock Gating and Frequency Scaling

– A lot has been done thus far– Still active

• Rethink Architectural Decisions– Orthogonal to others

Reduce Power Under Performance Constraints

The “Silver Bullet” Solution• Good if there was one• However, till one is found...

• Look at all structures• Rethink Design• Propose Power-Optimized versions

• This is what we’re doing for performance

Snoopy Cache Coherence

All L2 tags see all bus accessesIntervene when necessary

Main Memory

CPU Core

L1

L2

CPU Core

Hit

How About Power?

All L2 tags see all bus accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Main Memory

L1

L2

CPU CoreCPU Core CPU Core

missmiss

JETTY: A Would be Snoop-Miss Filter

Imprecise: May filter a would-be miss Never filters snoop-hits

JETTY

addr

Not here!

CPU n

Would be Snoop-Miss:

JETTY

addr

Don’t Know

CPU n

Would be Snoop-Hit:

Detect most misses using fewer resources

Potential for Savings Exist

• Most Snoops miss– 91% AVG

• Many L2 accesses are due to Snoop Misses– 55% AVG

• Sizeable Potential Power Savings:– 20% - 50% of total L2 power

Exclude-Jetty

• Subset of what is not cached

cached

not cached

How? Cache recent snoop-misses locally

ExcludeJETTY

Exclude-Jetty

• Subset of what you don’t have

Works well for producer-consumer

Include-Jetty

• Superset of what is cached

cached

not cached

How? Well...

includeJETTY

Include-Jetty

address

bit vector 0

bit vector 1

bit vector 2

f( )

h( )

g( )

• Not-CachedAny zero bit

• May be CachedAll bits set

Later I was told this is a Bloom filter…

Include-Jetty

• Superset of what you have

This is a counting bloom filter:L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

Partial overlapping indexes worked better

http://www.eecg.toronto.edu/~moshovos/research/lcbf.pdf

Hybrid-Jetty• Some cases Exclude-J works well• Some other Include-J is better• Combine

– Access in parallel on snoop– Allocation

• IJ always• If IJ fails to filter then to EJ• EJ coverage increases

Latency?• Jetty may increase snoop-response time• Can only be determined on a design by design basis• Largest Jetty:

– Five 32x32 bit register files

Results• Used SPLASH-II

– Scientific applications– “Large” Datasets

• e.g., 4-80Megs of main memory allocated• Access Counts: 60M-1.7B

– 4-way SMP, MOESI– 1M direct-mapped L2, 64b 32b subblocks– 32k direct-mapped L1, 32b blocks

• Coverage & Power (analytical model)

Coverage: Hybrid-Jetty

• Can capture 74% of all snoop-misses

bette

r

0%

20%

40%

60%

80%

100%

ba ch em ff fm lu oc ra rt un AVG10x4x7 + 32x4 9x4x7 + 32x4 8x4x7 + 32x4

Power-Savings

• 28% of overall L2 power

0%

10%

20%

30%

40%

50%

ba ch em ff fm oc ra rt un AVG

bette

r

Summary• Power is becoming important

– Performance, Reliability and Feasibility• Unique Opportunities Exist for Servers

• JETTY: Filter Snoops that would miss– 74% of all snoops– 28% of L2 power saved– No protocol changes– No performance loss

Power efficient cache coherence

C. Saldanha, M. LipastiWorkshop on Memory Performance Issues

(in conjunction with ISCA), June 2001.

MEMORY

Serial Snooping• Avoids Speculative transmission of Snoop packets.• Check the nearest neighbor• Data supplied with minimum latency and power

TLB and Snoop Energy-Reduction using Virtual Caches in

Low-Power Chip-Multiprocessors

Magnus Ekman, *Fredrik Dahlgren, and Per Stenström

Chalmers University of TechnologyEricsson Mobile Platforms

Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002

Page Sharing Tables• On snoop requesting node gets a page-level sharing vector

Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

If a PST entry is evicted the whole page must be evicted

29

RegionScout: Exploiting Coarse Grain Sharing in Snoop

Coherence

Andreas [email protected]

Int’l Conference on Computer Architecture 2005

30

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Improving Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

31

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

RegionScout: Avoid Some Snoops

• Frequent case: non-sharing even at a coarse level/Region• RegionScout: Dynamically Identify Non-Shared Regions

– First Request to a Region Identifies it as not Shared– Subsequent Requests do not need to be broadcast

• Uses Imprecise Information– Small structures– Layer on top of conventional coherence– No additional constraints

32

Roadmap• Conventional Coherence:

– The need for power-aware designs

• Potential: Program Behavior

• RegionScout: What and How

• Implementation

• Evaluation

• Summary

33

Coherence Basics

• Given request for memory block X (address)• Detect where its current value resides

Main Memory

snoopsnoop

X

hit

CPU CPU CPU

34

Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accessesBandwidth: broadcast all coherent requests

Main Memory

L2

CPU

missmiss

CPU CPU

35

RegionScout Motivation: Sharing is Coarse

• Region: large continuous memory area, power of 2 size• CPU X asks for data block in region R

1. No one else has X2. No one else has any block in R

RegionScout Exploits this BehaviorLayered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses

Optimization Opportunities

• Power and Bandwidth– Originating node: avoid asking others– Remote node: avoid tag lookup

CPU

I$ D$

CPU

I$ D$

Memory

SWITCH

CPU

I$ D$

Potential: Region Miss Frequency

0%

25%

50%

75%

100%

256 512 1K 2K 4K 8K 16K

p4.512K

p4.1M

p8.512K

p8.1M

% o

f all

requ

ests

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

bette

r

Glo

bal R

egio

n M

isse

s

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

3

Record: Non-Shared Regions Record: Locally Cached Regions

RegionScout at Work: Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

1

2


RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

12 2


• Requesting Node provides address:

• At Originating Node – from CPU: – Have I discovered that this region is not shared?

• At Remote Nodes – from Interconnect: – Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

CPU

address

Remembering Non-Shared Regions

• Records non-shared regions• Lookup by Region portion prior to issuing a request• Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments

What Regions are Locally Cached?

• If we had as many counters as regions:– Block Allocation: counter[region]++– Block Eviction: counter[region]--– Region cached only if counter[region] non-zero

• Not Practical:– E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter

Moshovos ©What Regions are Locally Cached?

• Use few Counters Imprecise: – Records a superset of locally cached Regions– False positives: lost opportunity, correctness preserved

Region Tag offset

counter

hashCached Region Hash

“Counter”: + on block allocation - on block evictionFew entries, e.g., 256

p bits

P-bit 1 if counter non-zero used for lookups

Moshovos ©Roadmap• Conventional Coherence

• Program Behavior: Region Miss Frequency

• RegionScout

• Evaluation

• Summary

Moshovos ©Evaluation Overview• Methodology

• Filter rates– Practical Filters can capture many Region Misses

• Interconnect bandwidth reduction

Moshovos ©Methodology• In-House simulator based on Simplescalar

– Execution driven– All instructions simulated – MIPS like ISA– System calls faked by passing them to host OS– Synchronization using load-linked/store-conditional– Simple in-order processors– Memory requests complete instantaneously– MESI snoop coherence– 1 or 2 level memory hierarchy– WATTCH power models

• SPLASH II benchmarks– Scientific workloads– Feasibility study

Moshovos ©Filter Rates

0%

25%

50%

75%

100%

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16KIden

tifie

dG

loba

l Reg

ion

Mis

ses

CRH Size

bette

r

For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential

Moshovos ©Bandwidth Reduction

0%

25%

50%

75%

100%

2K 4K 8K 16K

p4.512K

p8.512K

p4.64K

p8.64K

Mes

sage

s

Region Size

bette

r

CM

P

Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)

Moshovos ©Related Work• RegionScout

– Technical Report, Dec. 2003

• Jetty– Moshovos, Memik, Falsafi, Choudhary, HPCA 2001

• PST– Eckman, Dahlgren, and Stenström, ISLPED 2002

• Coarse-Grain Coherence– Cantin, Lipasti and Smith, ISCA 2005

Moshovos ©

51

Summary• Exploit program behavior/optimize a frequent case

– Many requests result in a global region miss

• RegionScout– Practical filter mechanism– Dynamically detect would-be region misses– Avoid broadcasts– Save tag lookup power and interconnect bandwidth – Small structures– Layered extension over existing mechanisms– Invisible to programmer and the OS

Coarse-Grain Coherence

J. Cantin, M. Lipasti and J. E. SmithISCA 2005

Coarse-Grain Coherence• Exploits the same phenomenon as RegionScout• Protocol extended to keep track of region state as well

– Additional optimizations• Uses an additional region tag array to do so• Region replacements

– Must scan and find the block and evict them

Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring

multiprocessorsK. Strauss, X. Shen, J. Torrellas

International Symposium on Computer Architecture, June 2006.

Karin Strauss

Flexible

Snoopi

ng

55

Predictors and algorithms

snoopforwardExact

forward then snoop

Aggforward

snoopforward then snoop

Subset

action on positive prediction

action on negative prediction

predictor / algorithm

Superset

Con snoop then forward

node can supply

in predictor

set of addresses:

Ring-specific

Karin Strauss

Flexible

Snoopi

ng

56

Predictor implementation

• Subset– associative table:

subset of addresses that can be supplied by node

• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):

addresses that recently suffered false positives

• Exact– associative table: all addresses that can be supplied by node– downgrading: if address has to be evicted from predictor table,

corresponding line in node has to be downgraded

Design and Implementation of the Blue Gene/P Snoop Filter

Valentina Salapura, Matthias Blumrich, Alan Gara

Int’l Conf. on High-Performance Computer Architecture, 2008

Three Mechanisms• Stream registers

– Contiguous data areas– Adaptive to cache arbitrarily sized contiguous regions with a single

register– Stream registers track strided and sequential streams

• Snoop caches– Cache of recently executed snoop requests– Multiple requests to same line do not have to cause multiple

snoop lookups– Snoop caches track locality

• Range filter– Identify regions of known non-shared data– Configured by software

Stream Registers• Base = where the block starts• Mask = which bits are common

– Example: base 0111 mask 1101 01X1 may be in the cache• Over time Mask becomes all zeros• How to reset?• Cache Wrap

– Each set uses Round-Robin replacement– Count replacements per set– Cache wrap when all counters > ways– Copy all streams to history and use combination– Next time throw out history

Stream Registers: An Example• Direct mapped cache with two blocks

• At this point the filter reports that the cache contains:– 001 and 011– 101 and 111

• The first two are not there• Eventually the filter becomes

saturated and can filter much• How can we get rid of the 011 /

1x1?

empty

empty

001

empty

empty

empty

001 / 111empty

001

011

001 / 1X1empty

101

011

001 / 111

101 / 111

101

111

001 / 1X1

101 / 1X1

Tim

e

cache Stream registers

Avoiding Saturation: Exploiting Cache Warping

empty

empty

001

empty

empty

empty

001 / 111empty

001

011

001 / 1X1empty

101

011

empty101 / 111

101

111

empty

101 / 1X1

Tim

ecache Stream registers

empty

empty

empty

empty

001 / 1X1empty

001 / 1X1

empty

001 / 1X1

empty

Shadow

Cache Warp Can discard Shadow

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip

MultiprocessorsChinnakrishnan S. Ballapuram

Ahmad SharifHsien-Hsin S. Lee

ASPLOS 2008

Software-Hardware Hybrid• Software Directs hardware what to do

– Mechanisms very similar to Jetty and RegionScout

• Paper incorrectly states that:– Jetty does not work for CMPs

• It does not work well for small scale CMPs– RegionScout works only for busses

• Is interconnect agnostic

RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip

Memory HierarchyJason Zebchuk, Elham Safi and Andreas Moshovos

Int’l Symposium on Microarchitecture, 2007

EPFL, Jan. 2008

66Aenao Group/Toronto

Future Caches: Just Larger?

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

1. “Big Picture” Management2. Store Metadata

10s – 100s of MB

EPFL, Jan. 2008


Conventional Block Centric Cache

• “Small” Blocks– Optimizes Bandwidth and Performance

• Large L2/L3 caches especially

Fine-Grain View of Memory

L2 Cache

Big Picture Lost

EPFL, Jan. 2008


“Big Picture” View

• Region: 2n sized, aligned area of memory• Patterns and behavior exposed

– Spatial locality

• Exploit for performance/area/power

Coarse-Grain View of Memory

L2 Cache

EPFL, Jan. 2008


Exploiting Coarse-Grain Patterns

• Many existing coarse-grain optimizations• Add new structures to track coarse-grain information

CPU

L2 Cache

Stealth Prefetching

Run-time Adaptive Cache Hierarchy Management via Reference Analysis

Destination-Set Prediction

Spatial Memory Streaming

Coarse-Grain Coherence Tracking

RegionScout

Circuit-Switched Coherence

Hard to justify for a commercial design

Coarse-Grain Framework

Embed coarse-grain information in tag array

Support many different optimizations with less area overhead

Adaptable optimization FRAMEWORK

Virtual Tree CoherencePower-Efficient DRAMSpeculation

EPFL, Jan. 2008


L2 Cache

RegionTracker Solution

Manage blocks, but also track and manage regions

Tag Array

L1

L1

L1

L1

Data Array

Data Blocks

BlockRequests

Block Requests

RegionTracker

RegionProbes

RegionResponses

EPFL, Jan. 2008


RegionTracker Summary• Replace conventional tag array:

– 4-core CMP with 8MB shared L2 cache– Within 1% of original performance– Up to 20% less tag area– Average 33% less energy consumption

• Optimization Framework:– Stealth Prefetching: same performance, 36% less area– RegionScout: 2x more snoops avoided, no area overhead

EPFL, Jan. 2008


Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion

EPFL, Jan. 2008


Goals1. Conventional Tag Array Functionality

– Identify data block location and state– Leave data array un-changed

2. Optimization Framework Functionality– Is Region X cached?– Which blocks of Region X are cached? Where?– Evict or migrate Region X– Easy to assign properties to each Region

EPFL, Jan. 2008


Coarse-Grain Cache Designs

• Increased BW, Decreased hit-rates

Region X

Large Block SizeTag Array Data Array

EPFL, Jan. 2008


Sector Cache

• Decreased hit-rates

Region X

Tag Array Data Array

EPFL, Jan. 2008


Sector Pool Cache

• High Associativity (2 - 4 times)

Region X

Tag Array Data Array

EPFL, Jan. 2008


Decoupled Sector Cache

• Region information not exposed• Region replacement requires scanning multiple

entries

Region X

Tag Array Data ArrayStatus Table

EPFL, Jan. 2008


Design Requirements• Small block size (64B)• Miss-rate does not increase• Lookup associativity does not increase• No additional access latency

– (i.e., No scanning, no multiple block evictions)

• Does not increase latency, area, or energy• Allows banking and interleaving

• Fit in conventional tag array “envelope”

EPFL, Jan. 2008


RegionTracker: A Tag Array Replacement

L1

L1

L1

L1

Data Array

• 3 SRAM arrays, combined smaller than tag array

RegionVectorArray

BlockStatusTable

EvictedRegionBuffer

EPFL, Jan. 2008


Common Case: Hit

Region Tag RVA Index Region Offset Block Offset49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

1 4

status

3 2

Data Array + BST Index

To Data Array

Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

EPFL, Jan. 2008


Worst Case (Rare): Region Miss

Region Tag RVA Index Region Offset Block Offset

49 061021

Address:

Region Vector Array(RVA)

Region Tag ……

block0

block15

wayV

Block Offset19 6 0

Block Status Table(BST)

status

3

Ptr

2

Data Array + BST Index

EvictedRegionBuffer(ERB)No

Match!

Ptr


Methodology• Flexus simulator from CMU SimFlex group

– Based on Simics full-system simulator• 4-core CMP modeled after Piranha

– Private 32KB, 4-way set-associative L1 caches– Shared 8MB, 16-way set-associative L2 cache– 64-byte blocks

• Miss-rates: Functional simulation of 2 billion instructions per core• Performance and Energy: Timing simulation using SMARTS sampling methodology• Area and Power: Full custom implementation on 130nm commercial technology• 9 commercial workloads:

– WEB: SpecWEB on Apache and Zeus– OLTP: TPC-C on DB2 and Oracle– DSS: 5 TPC-H queries on DB2

Interconnect

L2

PD$ I$

PD$ I$

PD$ I$

PD$ I$


Miss-Rates vs. Area

• Sector Cache: 512KB sectors, SPC and RT: 1KB regions• Trade-offs comparable to conventional cache

0.99

1

1.01

1.02

1.03

1.04

1.05

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Sector Pool Cache

RegionTracker

Conventional Tags

better

Rel

ativ

e M

iss-

Rat

e

Relative Tag Array Area

Sector Cache (0.25, 1.26)

14-way 15-way

52-way

48-way

84Aenao Group/TorontoEPFL, Jan. 2008

Performance & Energy

0.97

0.98

0.99

1.00

1.01

1.02

1.03

WEB OLTP DSS0%

10%

20%

30%

40%

50%

WEB OLTP DSS

• 12-way set-associative RegionTracker: 20% less area• Error bars: 95% confidence interval

• Performance within 1%, with 33% tag energy reduction

Nor

mal

ized

Exe

cutio

n Ti

me

better

Red

uctio

n in

Tag

Ene

rgy better

Performance Energy


Road Map

• Introduction

• Goals

• Coarse-Grain Cache Designs

• RegionTracker: A Tag Array Replacement

• RegionTracker: An Optimization Framework

• Conclusion


RegionTracker: An Optimization Framework

L1

L1

L1

L1

RVA

ERB

Data Array

BST

Stealth Prefetching:Average 20% performance improvementDrop-in RegionTracker for 36% less area overhead

RegionScout:In-depth analysis


Snoop Coherence: Common Case

Main Memory

CPU CPU CPURead x

missmiss

Read x+1Read x+2Read x+n

Many snoops are to non-shared regions


RegionScout

Eliminate broadcasts for non-shared regions

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss

Non-Shared Regions Locally Cached Regions

Read x

RegionMiss

MissMiss


RegionTracker Implementation

• Minimal overhead to support RegionScout optimization

• Still uses less area than conventional tag array

Non-Shared Regions

Add 1 bit to each RVA entry

Locally Cached Regions

Already provided by RVA


RegionTracker + RegionScout

RS 7KB RS 12KB RS 22KB RSRT0%

10%

20%

30%

40%

50%HMEAN

Red

uctio

n in

Sno

op B

road

cast

s

better

4 processors, 512KB L2 Caches 1KB regions

Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag array

EPFL, Jan. 2008


Result Summary• Replace Conventional Tag Array:

– 20% Less tag area– 33% Less tag energy– Within 1% of original performance

• Coarse-Grain Optimization Framework:– 36% reduction in area overhead for Stealth Prefetching– Filter 41% of snoop broadcasts with no area overhead compared

to conventional cache

Documents

Snoop Filtering and Coarse-Grain Memory Tracking