Upload
marla
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Snoop Filtering and Coarse-Grain Memory Tracking. Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors. JETTY Snoop-Filtering for Reduced Power in SMP Servers. - PowerPoint PPT Presentation
Citation preview
Snoop Filtering and
Coarse-Grain Memory TrackingAndreas MoshovosUniv. of Toronto/ECE
Short Course at the University of Zaragoza, July 2009Some slides by J. Zebchuk or the original paper authors
JETTY Snoop-Filtering for Reduced Power in SMP
ServersAndreas Moshovos
Babak Falsafi, ECE, Carnegie MellonGokhan Memik, ECE, Northwestern
Alok Choudhary, ECE, Northwestern
Int’l Conference on High-Performance Architecture, 2001
Power is Becoming Important• Architecture is a science of tradeoffs• Thus far:
Performance vs. Cost vs. Complexity• Today:
vs. Power
• Where?– Mobile Devices– Desktops/Servers Our Focus
Power-Aware Servers• Revisit the design of SMP servers
– 2 or more CPUs per machine– Snoop coherence-based
• Why?– File, web, databases, your typical desktop– Cost effective too
• This work - a first step:Power-Aware Snoopy-Coherence
Power-Aware Snoop-Coherence• Conventional
– All L2 caches snoop all memory traffic– Power expended by all on any memory access
• Jetty-Enhanced– Tiny structure on L2-backside– Filters most “would-be-misses”– Less power expended on most snoop misses– No changes to protocol necessary– No performance loss
Roadmap• Why Power is a Concern for Servers?• Snoopy-Coherence Basics• An Opportunity for Reducing Power• JETTY• Results• Summary
Why is Power Important?Power Could Ultimately Limit Performance
• Power Demands have been increasing• Deliver Energy to and on chip• Dissipate Heat• Limit:
– Amount of resources & frequency– Feasibility
• Cooling a solution: Cost & Integration?Reducing Power Demands is much more convenient
What can be done?• Redesign Circuits• Clock Gating and Frequency Scaling
– A lot has been done thus far– Still active
• Rethink Architectural Decisions– Orthogonal to others
Reduce Power Under Performance Constraints
The “Silver Bullet” Solution• Good if there was one• However, till one is found...
• Look at all structures• Rethink Design• Propose Power-Optimized versions
• This is what we’re doing for performance
Snoopy Cache Coherence
All L2 tags see all bus accessesIntervene when necessary
Main Memory
CPU Core
L1
L2
CPU Core
Hit
How About Power?
All L2 tags see all bus accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses
Main Memory
L1
L2
CPU CoreCPU Core CPU Core
missmiss
JETTY: A Would be Snoop-Miss Filter
Imprecise: May filter a would-be miss Never filters snoop-hits
JETTY
addr
Not here!
CPU n
Would be Snoop-Miss:
JETTY
addr
Don’t Know
CPU n
Would be Snoop-Hit:
Detect most misses using fewer resources
Potential for Savings Exist
• Most Snoops miss– 91% AVG
• Many L2 accesses are due to Snoop Misses– 55% AVG
• Sizeable Potential Power Savings:– 20% - 50% of total L2 power
Exclude-Jetty
• Subset of what is not cached
cached
not cached
How? Cache recent snoop-misses locally
ExcludeJETTY
Exclude-Jetty
• Subset of what you don’t have
Works well for producer-consumer
Include-Jetty
• Superset of what is cached
cached
not cached
How? Well...
includeJETTY
Include-Jetty
address
bit vector 0
bit vector 1
bit vector 2
f( )
h( )
g( )
• Not-CachedAny zero bit
• May be CachedAll bits set
Later I was told this is a Bloom filter…
Include-Jetty
• Superset of what you have
This is a counting bloom filter:L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.
Partial overlapping indexes worked better
Hybrid-Jetty• Some cases Exclude-J works well• Some other Include-J is better• Combine
– Access in parallel on snoop– Allocation
• IJ always• If IJ fails to filter then to EJ• EJ coverage increases
Latency?• Jetty may increase snoop-response time• Can only be determined on a design by design basis• Largest Jetty:
– Five 32x32 bit register files
Results• Used SPLASH-II
– Scientific applications– “Large” Datasets
• e.g., 4-80Megs of main memory allocated• Access Counts: 60M-1.7B
– 4-way SMP, MOESI– 1M direct-mapped L2, 64b 32b subblocks– 32k direct-mapped L1, 32b blocks
• Coverage & Power (analytical model)
Coverage: Hybrid-Jetty
• Can capture 74% of all snoop-misses
bette
r
0%
20%
40%
60%
80%
100%
ba ch em ff fm lu oc ra rt un AVG10x4x7 + 32x4 9x4x7 + 32x4 8x4x7 + 32x4
Power-Savings
• 28% of overall L2 power
0%
10%
20%
30%
40%
50%
ba ch em ff fm oc ra rt un AVG
bette
r
Summary• Power is becoming important
– Performance, Reliability and Feasibility• Unique Opportunities Exist for Servers
• JETTY: Filter Snoops that would miss– 74% of all snoops– 28% of L2 power saved– No protocol changes– No performance loss
Power efficient cache coherence
C. Saldanha, M. LipastiWorkshop on Memory Performance Issues
(in conjunction with ISCA), June 2001.
MEMORY
Serial Snooping• Avoids Speculative transmission of Snoop packets.• Check the nearest neighbor• Data supplied with minimum latency and power
TLB and Snoop Energy-Reduction using Virtual Caches in
Low-Power Chip-Multiprocessors
Magnus Ekman, *Fredrik Dahlgren, and Per Stenström
Chalmers University of TechnologyEricsson Mobile Platforms
Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002
Page Sharing Tables• On snoop requesting node gets a page-level sharing vector
Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs
If a PST entry is evicted the whole page must be evicted
29
RegionScout: Exploiting Coarse Grain Sharing in Snoop
Coherence
Andreas [email protected]
Int’l Conference on Computer Architecture 2005
30
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
Improving Snoop Coherence
Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth
Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use
Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping
31
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
RegionScout: Avoid Some Snoops
• Frequent case: non-sharing even at a coarse level/Region• RegionScout: Dynamically Identify Non-Shared Regions
– First Request to a Region Identifies it as not Shared– Subsequent Requests do not need to be broadcast
• Uses Imprecise Information– Small structures– Layer on top of conventional coherence– No additional constraints
32
Roadmap• Conventional Coherence:
– The need for power-aware designs
• Potential: Program Behavior
• RegionScout: What and How
• Implementation
• Evaluation
• Summary
33
Coherence Basics
• Given request for memory block X (address)• Detect where its current value resides
Main Memory
snoopsnoop
X
hit
CPU CPU CPU
34
Conventional Coherence not Power-Aware/Bandwidth-Effective
All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accessesBandwidth: broadcast all coherent requests
Main Memory
L2
CPU
missmiss
CPU CPU
35
RegionScout Motivation: Sharing is Coarse
• Region: large continuous memory area, power of 2 size• CPU X asks for data block in region R
1. No one else has X2. No one else has any block in R
RegionScout Exploits this BehaviorLayered Extension over Snoop Coherence
Typical Memory Space Snapshot: colored by owner(s)
addresses
Optimization Opportunities
• Power and Bandwidth– Originating node: avoid asking others– Remote node: avoid tag lookup
CPU
I$ D$
CPU
I$ D$
Memory
SWITCH
CPU
I$ D$
Potential: Region Miss Frequency
0%
25%
50%
75%
100%
256 512 1K 2K 4K 8K 16K
p4.512K
p4.1M
p8.512K
p8.1M
% o
f all
requ
ests
Region Size
Even with a 16K Region~45% of requests miss in all remote nodes
bette
r
Glo
bal R
egio
n M
isse
s
RegionScout at Work: Non-Shared Region Discovery
First request detects a non-shared region
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss Region Miss12 2
3
Record: Non-Shared Regions Record: Locally Cached Regions
RegionScout at Work: Avoiding Snoops
Subsequent request avoids snoops
Main Memory
CPUCPU CPU
Global Region Miss
1
2
Record: Non-Shared Regions Record: Locally Cached Regions
RegionScout is Self-Correcting
Request from another node invalidates non-shared record
Main Memory
CPUCPU CPU
12 2
Record: Non-Shared Regions Record: Locally Cached Regions
• Requesting Node provides address:
• At Originating Node – from CPU: – Have I discovered that this region is not shared?
• At Remote Nodes – from Interconnect: – Do I have a block in the region?
Implementation: Requirements
Region Tag offsetlg(Region Size)
CPU
address
Remembering Non-Shared Regions
• Records non-shared regions• Lookup by Region portion prior to issuing a request• Snoop requests and invalidate
Region Tag offsetaddress
validNon-Shared Region Table
Few entries16x4 in most experiments
What Regions are Locally Cached?
• If we had as many counters as regions:– Block Allocation: counter[region]++– Block Eviction: counter[region]--– Region cached only if counter[region] non-zero
• Not Practical:– E.g., 16K Regions and 4G Memory 256K counters
Region Tag offset
counter
Moshovos ©What Regions are Locally Cached?
• Use few Counters Imprecise: – Records a superset of locally cached Regions– False positives: lost opportunity, correctness preserved
Region Tag offset
counter
hashCached Region Hash
“Counter”: + on block allocation - on block evictionFew entries, e.g., 256
p bits
P-bit 1 if counter non-zero used for lookups
Moshovos ©Roadmap• Conventional Coherence
• Program Behavior: Region Miss Frequency
• RegionScout
• Evaluation
• Summary
Moshovos ©Evaluation Overview• Methodology
• Filter rates– Practical Filters can capture many Region Misses
• Interconnect bandwidth reduction
Moshovos ©Methodology• In-House simulator based on Simplescalar
– Execution driven– All instructions simulated – MIPS like ISA– System calls faked by passing them to host OS– Synchronization using load-linked/store-conditional– Simple in-order processors– Memory requests complete instantaneously– MESI snoop coherence– 1 or 2 level memory hierarchy– WATTCH power models
• SPLASH II benchmarks– Scientific workloads– Feasibility study
Moshovos ©Filter Rates
0%
25%
50%
75%
100%
256 512 1K 2K
p4.512K.R4K
p4.512K.R16K
p8.512K.R4K
p8.512K.R16KIden
tifie
dG
loba
l Reg
ion
Mis
ses
CRH Size
bette
r
For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential
Moshovos ©Bandwidth Reduction
0%
25%
50%
75%
100%
2K 4K 8K 16K
p4.512K
p8.512K
p4.64K
p8.64K
Mes
sage
s
Region Size
bette
r
CM
P
Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)
Moshovos ©Related Work• RegionScout
– Technical Report, Dec. 2003
• Jetty– Moshovos, Memik, Falsafi, Choudhary, HPCA 2001
• PST– Eckman, Dahlgren, and Stenström, ISLPED 2002
• Coarse-Grain Coherence– Cantin, Lipasti and Smith, ISCA 2005
Moshovos ©
51
Summary• Exploit program behavior/optimize a frequent case
– Many requests result in a global region miss
• RegionScout– Practical filter mechanism– Dynamically detect would-be region misses– Avoid broadcasts– Save tag lookup power and interconnect bandwidth – Small structures– Layered extension over existing mechanisms– Invisible to programmer and the OS
Coarse-Grain Coherence
J. Cantin, M. Lipasti and J. E. SmithISCA 2005
Coarse-Grain Coherence• Exploits the same phenomenon as RegionScout• Protocol extended to keep track of region state as well
– Additional optimizations• Uses an additional region tag array to do so• Region replacements
– Must scan and find the block and evict them
Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring
multiprocessorsK. Strauss, X. Shen, J. Torrellas
International Symposium on Computer Architecture, June 2006.
Karin Strauss
Flexible
Snoopi
ng
55
Predictors and algorithms
snoopforwardExact
forward then snoop
Aggforward
snoopforward then snoop
Subset
action on positive prediction
action on negative prediction
predictor / algorithm
Superset
Con snoop then forward
node can supply
in predictor
set of addresses:
Ring-specific
Karin Strauss
Flexible
Snoopi
ng
56
Predictor implementation
• Subset– associative table:
subset of addresses that can be supplied by node
• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):
addresses that recently suffered false positives
• Exact– associative table: all addresses that can be supplied by node– downgrading: if address has to be evicted from predictor table,
corresponding line in node has to be downgraded
Design and Implementation of the Blue Gene/P Snoop Filter
Valentina Salapura, Matthias Blumrich, Alan Gara
Int’l Conf. on High-Performance Computer Architecture, 2008
Three Mechanisms• Stream registers
– Contiguous data areas– Adaptive to cache arbitrarily sized contiguous regions with a single
register– Stream registers track strided and sequential streams
• Snoop caches– Cache of recently executed snoop requests– Multiple requests to same line do not have to cause multiple
snoop lookups– Snoop caches track locality
• Range filter– Identify regions of known non-shared data– Configured by software
Stream Registers• Base = where the block starts• Mask = which bits are common
– Example: base 0111 mask 1101 01X1 may be in the cache• Over time Mask becomes all zeros• How to reset?• Cache Wrap
– Each set uses Round-Robin replacement– Count replacements per set– Cache wrap when all counters > ways– Copy all streams to history and use combination– Next time throw out history
Stream Registers: An Example• Direct mapped cache with two blocks
• At this point the filter reports that the cache contains:– 001 and 011– 101 and 111
• The first two are not there• Eventually the filter becomes
saturated and can filter much• How can we get rid of the 011 /
1x1?
empty
empty
001
empty
empty
empty
001 / 111empty
001
011
001 / 1X1empty
101
011
001 / 111
101 / 111
101
111
001 / 1X1
101 / 1X1
Tim
e
cache Stream registers
Avoiding Saturation: Exploiting Cache Warping
empty
empty
001
empty
empty
empty
001 / 111empty
001
011
001 / 1X1empty
101
011
empty101 / 111
101
111
empty
101 / 1X1
Tim
ecache Stream registers
empty
empty
empty
empty
001 / 1X1empty
001 / 1X1
empty
001 / 1X1
empty
Shadow
Cache Warp Can discard Shadow
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip
MultiprocessorsChinnakrishnan S. Ballapuram
Ahmad SharifHsien-Hsin S. Lee
ASPLOS 2008
Software-Hardware Hybrid• Software Directs hardware what to do
– Mechanisms very similar to Jetty and RegionScout
• Paper incorrectly states that:– Jetty does not work for CMPs
• It does not work well for small scale CMPs– RegionScout works only for busses
• Is interconnect agnostic
RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip
Memory HierarchyJason Zebchuk, Elham Safi and Andreas Moshovos
Int’l Symposium on Microarchitecture, 2007
EPFL, Jan. 2008
66Aenao Group/Toronto
Future Caches: Just Larger?
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
1. “Big Picture” Management2. Store Metadata
10s – 100s of MB
EPFL, Jan. 2008
67Aenao Group/Toronto
Conventional Block Centric Cache
• “Small” Blocks– Optimizes Bandwidth and Performance
• Large L2/L3 caches especially
Fine-Grain View of Memory
L2 Cache
Big Picture Lost
EPFL, Jan. 2008
68Aenao Group/Toronto
“Big Picture” View
• Region: 2n sized, aligned area of memory• Patterns and behavior exposed
– Spatial locality
• Exploit for performance/area/power
Coarse-Grain View of Memory
L2 Cache
EPFL, Jan. 2008
69Aenao Group/Toronto
Exploiting Coarse-Grain Patterns
• Many existing coarse-grain optimizations• Add new structures to track coarse-grain information
CPU
L2 Cache
Stealth Prefetching
Run-time Adaptive Cache Hierarchy Management via Reference Analysis
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched Coherence
Hard to justify for a commercial design
Coarse-Grain Framework
Embed coarse-grain information in tag array
Support many different optimizations with less area overhead
Adaptable optimization FRAMEWORK
Virtual Tree CoherencePower-Efficient DRAMSpeculation
EPFL, Jan. 2008
70Aenao Group/Toronto
L2 Cache
RegionTracker Solution
Manage blocks, but also track and manage regions
Tag Array
L1
L1
L1
L1
Data Array
Data Blocks
BlockRequests
Block Requests
RegionTracker
RegionProbes
RegionResponses
EPFL, Jan. 2008
71Aenao Group/Toronto
RegionTracker Summary• Replace conventional tag array:
– 4-core CMP with 8MB shared L2 cache– Within 1% of original performance– Up to 20% less tag area– Average 33% less energy consumption
• Optimization Framework:– Stealth Prefetching: same performance, 36% less area– RegionScout: 2x more snoops avoided, no area overhead
EPFL, Jan. 2008
72Aenao Group/Toronto
Road Map
• Introduction
• Goals
• Coarse-Grain Cache Designs
• RegionTracker: A Tag Array Replacement
• RegionTracker: An Optimization Framework
• Conclusion
EPFL, Jan. 2008
73Aenao Group/Toronto
Goals1. Conventional Tag Array Functionality
– Identify data block location and state– Leave data array un-changed
2. Optimization Framework Functionality– Is Region X cached?– Which blocks of Region X are cached? Where?– Evict or migrate Region X– Easy to assign properties to each Region
EPFL, Jan. 2008
74Aenao Group/Toronto
Coarse-Grain Cache Designs
• Increased BW, Decreased hit-rates
Region X
Large Block SizeTag Array Data Array
EPFL, Jan. 2008
75Aenao Group/Toronto
Sector Cache
• Decreased hit-rates
Region X
Tag Array Data Array
EPFL, Jan. 2008
76Aenao Group/Toronto
Sector Pool Cache
• High Associativity (2 - 4 times)
Region X
Tag Array Data Array
EPFL, Jan. 2008
77Aenao Group/Toronto
Decoupled Sector Cache
• Region information not exposed• Region replacement requires scanning multiple
entries
Region X
Tag Array Data ArrayStatus Table
EPFL, Jan. 2008
78Aenao Group/Toronto
Design Requirements• Small block size (64B)• Miss-rate does not increase• Lookup associativity does not increase• No additional access latency
– (i.e., No scanning, no multiple block evictions)
• Does not increase latency, area, or energy• Allows banking and interleaving
• Fit in conventional tag array “envelope”
EPFL, Jan. 2008
79Aenao Group/Toronto
RegionTracker: A Tag Array Replacement
L1
L1
L1
L1
Data Array
• 3 SRAM arrays, combined smaller than tag array
RegionVectorArray
BlockStatusTable
EvictedRegionBuffer
EPFL, Jan. 2008
80Aenao Group/Toronto
Common Case: Hit
Region Tag RVA Index Region Offset Block Offset49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
1 4
status
3 2
Data Array + BST Index
To Data Array
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
EPFL, Jan. 2008
81Aenao Group/Toronto
Worst Case (Rare): Region Miss
Region Tag RVA Index Region Offset Block Offset
49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
status
3
Ptr
2
Data Array + BST Index
EvictedRegionBuffer(ERB)No
Match!
Ptr
82Aenao Group/Toronto
Methodology• Flexus simulator from CMU SimFlex group
– Based on Simics full-system simulator• 4-core CMP modeled after Piranha
– Private 32KB, 4-way set-associative L1 caches– Shared 8MB, 16-way set-associative L2 cache– 64-byte blocks
• Miss-rates: Functional simulation of 2 billion instructions per core• Performance and Energy: Timing simulation using SMARTS sampling methodology• Area and Power: Full custom implementation on 130nm commercial technology• 9 commercial workloads:
– WEB: SpecWEB on Apache and Zeus– OLTP: TPC-C on DB2 and Oracle– DSS: 5 TPC-H queries on DB2
Interconnect
L2
PD$ I$
PD$ I$
PD$ I$
PD$ I$
83Aenao Group/Toronto
Miss-Rates vs. Area
• Sector Cache: 512KB sectors, SPC and RT: 1KB regions• Trade-offs comparable to conventional cache
0.99
1
1.01
1.02
1.03
1.04
1.05
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Sector Pool Cache
RegionTracker
Conventional Tags
better
Rel
ativ
e M
iss-
Rat
e
Relative Tag Array Area
Sector Cache (0.25, 1.26)
14-way 15-way
52-way
48-way
84Aenao Group/TorontoEPFL, Jan. 2008
Performance & Energy
0.97
0.98
0.99
1.00
1.01
1.02
1.03
WEB OLTP DSS0%
10%
20%
30%
40%
50%
WEB OLTP DSS
• 12-way set-associative RegionTracker: 20% less area• Error bars: 95% confidence interval
• Performance within 1%, with 33% tag energy reduction
Nor
mal
ized
Exe
cutio
n Ti
me
better
Red
uctio
n in
Tag
Ene
rgy better
Performance Energy
85Aenao Group/Toronto
Road Map
• Introduction
• Goals
• Coarse-Grain Cache Designs
• RegionTracker: A Tag Array Replacement
• RegionTracker: An Optimization Framework
• Conclusion
86Aenao Group/Toronto
RegionTracker: An Optimization Framework
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Stealth Prefetching:Average 20% performance improvementDrop-in RegionTracker for 36% less area overhead
RegionScout:In-depth analysis
87Aenao Group/Toronto
Snoop Coherence: Common Case
Main Memory
CPU CPU CPURead x
missmiss
Read x+1Read x+2Read x+n
Many snoops are to non-shared regions
88Aenao Group/Toronto
RegionScout
Eliminate broadcasts for non-shared regions
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss
Non-Shared Regions Locally Cached Regions
Read x
RegionMiss
MissMiss
89Aenao Group/Toronto
RegionTracker Implementation
• Minimal overhead to support RegionScout optimization
• Still uses less area than conventional tag array
Non-Shared Regions
Add 1 bit to each RVA entry
Locally Cached Regions
Already provided by RVA
90Aenao Group/Toronto
RegionTracker + RegionScout
RS 7KB RS 12KB RS 22KB RSRT0%
10%
20%
30%
40%
50%HMEAN
Red
uctio
n in
Sno
op B
road
cast
s
better
4 processors, 512KB L2 Caches 1KB regions
Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag array
EPFL, Jan. 2008
91Aenao Group/Toronto
Result Summary• Replace Conventional Tag Array:
– 20% Less tag area– 33% Less tag energy– Within 1% of original performance
• Coarse-Grain Optimization Framework:– 36% reduction in area overhead for Stealth Prefetching– Filter 41% of snoop broadcasts with no area overhead compared
to conventional cache