Upload
ossie
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy. Jason Zebchuk , Elham Safi, and Andreas Moshovos { zebchuk ,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto. - PowerPoint PPT Presentation
Citation preview
A Framework for Coarse-Grain Optimizations in the On-Chip
Memory Hierarchy
Jason Zebchuk, Elham Safi, and Andreas Moshovos{zebchuk,elham,moshovos}@eecg.toronto.edu
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto
Jason Zebchuk 2 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Conventional Block Centric Cache
“Small” Blocks Optimizes Bandwidth and Performance
Large L2/L3 caches especially
Fine-Grain View of Memory
L2 Cache
Big Picture Lost
Jason Zebchuk 3 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
“Big Picture” View
Region: 2n sized, aligned area of memory Patterns and behavior exposed
Spatial locality
Exploit for performance/area/power
Coarse-Grain View of Memory
L2 Cache
Jason Zebchuk 4 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Exploiting Coarse-Grain Patterns
Many existing coarse-grain optimizations Add new structures to track coarse-grain information
CPU
L2 Cache
Stealth Prefetching
Flexible Snooping
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched
Coherence
Hard to justify for a commercial design
Coarse-Grain Framework
Jason Zebchuk 5 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Exploiting Coarse-Grain Patterns
CPU
L2 Cache
Coarse-Grain Framework
Embed coarse-grain information in tag array
Support many different optimizations with less area overhead
Adaptable optimization FRAMEWORK
Jason Zebchuk 6 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
L2 Cache
RegionTracker Solution
Manage blocks, but also track and manage regions
Tag Array
L1
L1
L1
L1
Data Array
Data Blocks
BlockRequests
Block Requests
RegionTracker
RegionProbes
RegionResponses
Jason Zebchuk 7 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionTracker Summary
Replace conventional tag array: 4-core CMP with 8MB shared L2 cache Within 1% of original performance Up to 20% less tag area Average 33% less energy consumption
Optimization Framework: Stealth Prefetching: same performance, 36% less area RegionScout: 2x more snoops avoided, no area overhead
Jason Zebchuk 8 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
Conclusion
Jason Zebchuk 9 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Goals
1. Conventional Tag Array Functionality Identify data block location and state Leave data array un-changed
2. Optimization Framework Functionality Is Region X cached? Which blocks of Region X are cached? Where? Evict or migrate Region X
Jason Zebchuk 10 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Coarse-Grain Cache Designs
Increased BW, Decreased hit-rates
Region X
Large Block Size
Tag Array Data Array
Jason Zebchuk 11 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Sector Cache
Decreased hit-rates
Region X
Tag Array
Data Array
Jason Zebchuk 12 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Sector Pool Cache
High Associativity (2 - 4 times)
Region X
Tag Array Data Array
Jason Zebchuk 13 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Decoupled Sector Cache
Region information not exposed Region replacement requires scanning multiple entries
Region X
Tag Array Data ArrayStatus Table
Jason Zebchuk 14 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Design Requirements
Small block size (64B) Miss-rate does not increase Lookup associativity does not increase No additional access latency
(i.e., No scanning, no multiple block evictions)
Does not increase latency, area, or energy Allows banking and interleaving
Fit in conventional tag array “envelope”
Jason Zebchuk 15 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionTracker: A Tag Array Replacement
L1
L1
L1
L1
Data Array
3 SRAM arrays, combined smaller than tag array
RegionVectorArray
BlockStatusTable
EvictedRegionBuffer
Jason Zebchuk 16 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Common Case: Hit
Region Tag RVA Index Region OffsetBlock Offset49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
1 4
status
3 2
Data Array + BST Index
To Data Array
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
Jason Zebchuk 17 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Worst Case (Rare): Region Miss
Region Tag RVA Index Region OffsetBlock Offset
49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
status
3
Ptr
2
Data Array + BST Index
EvictedRegionBuffer(ERB)No
Match!
Ptr
Jason Zebchuk 18 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Methodology
Flexus simulator from CMU SimFlex group Based on Simics full-system simulator
4-core CMP modeled after Piranha Private 32KB, 4-way set-associative L1 caches Shared 8MB, 16-way set-associative L2 cache 64-byte blocks
Miss-rates: Functional simulation of 2 billion instructions per core Performance and Energy: Timing simulation using SMARTS sampling
methodology Area and Power: Full custom implementation on 130nm commercial
technology 9 commercial workloads:
WEB: SpecWEB on Apache and Zeus OLTP: TPC-C on DB2 and Oracle DSS: 5 TPC-H queries on DB2
Interconnect
L2
P
D$ I$
P
D$ I$
P
D$ I$
P
D$ I$
Jason Zebchuk 19 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Miss-Rates vs. Area
Sector Cache: 512KB sectors, SPC and RT: 1KB regions Trade-offs comparable to conventional cache
0.99
1
1.01
1.02
1.03
1.04
1.05
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Sector Pool Cache
RegionTracker
Conventional Tags
better
Rela
tive M
iss-
Rate
Relative Tag Array Area
Sector Cache (0.25, 1.26)
14-way 15-way
52-way
48-way
Jason Zebchuk 20 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Performance & Energy
0.97
0.98
0.99
1.00
1.01
1.02
1.03
WEB OLTP DSS0%
10%
20%
30%
40%
50%
WEB OLTP DSS
12-way set-associative RegionTracker: 20% less area Error bars: 95% confidence interval
Performance within 1%, with 33% tag energy reduction
Norm
aliz
ed E
xecu
tion T
ime
better
Reduct
ion in T
ag E
nerg
y
better
Performance Energy
Jason Zebchuk 21 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
Conclusion
Jason Zebchuk 22 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionTracker: An Optimization Framework
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Stealth Prefetching:Average 20% performance improvement
Drop-in RegionTracker for 36% less area overhead
RegionScout:In-depth analysis
Jason Zebchuk 23 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Snoop Coherence: Common Case
Main Memory
CPU CPU CPU
Read x
mis
sm
iss
Read x+1Read x+n
Many snoops are to non-shared regions
Jason Zebchuk 24 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionScout
Eliminate broadcasts for non-shared regions
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss
Non-Shared Regions Locally Cached Regions
Read xRead x
RegionMiss
MissMiss
Jason Zebchuk 25 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionTracker Implementation
Minimal overhead to support RegionScout optimization
Still uses less area than conventional tag array
Non-Shared Regions
Add 1 bit to each RVA entry
Locally Cached Regions
Already provided by RVA
Jason Zebchuk 26 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
RegionTracker + RegionScout
0%
10%
20%
30%
40%
50%
60%
RS 7KB RS 12KB RS 22KB RSRT
Reduct
ion in
Snoop B
roadca
sts
better
4 processors, 512KB L2 Caches 1KB regions
Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag
array
BlockScout(4KB)
New optimization possible with
RegionTracker
Jason Zebchuk 27 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Result Summary
Replace Conventional Tag Array: 20% Less tag area 33% Less tag energy Within 1% of original performance
Coarse-Grain Optimization Framework: 36% reduction in area overhead for Stealth Prefetching Filter 41% of snoop broadcasts with no area overhead
compared to conventional cache
Jason Zebchuk 28 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy
Exploiting Coarse-Grain Patterns
CPU
L2 Cache
Stealth Prefetching
Run-time Adaptive Cache Hierarchy Management via
Reference Analysis
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched
Coherence
Conclusion
RegionTracker framework makes coarse-grainoptimizations more attractive
CPU
L2 Cache
A Framework for Coarse-Grain Optimizations in the On-Chip
Memory Hierarchy
Jason Zebchuk, Elham Safi, and Andreas Moshovos{zebchuk,elham,moshovos}@eecg.toronto.edu
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto