Upload
thomasine-reynolds
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Two Ways to Exploit Multi-Megabyte Caches
AENAO Research Group @ TorontoKaveh Aasaraai
Ioana Burcea
Myrto Papadopoulou
Elham Safi
Jason Zebchuk
Andreas Moshovos
{aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
EPFL, Jan. 2008 2Aenao Group/Toronto
Future Caches: Just Larger?
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
1. “Big Picture” Management2. Store Metadata
10s – 100s of MB
EPFL, Jan. 2008 3Aenao Group/Toronto
Conventional Block Centric Cache
“Small” Blocks Optimizes Bandwidth and Performance
Large L2/L3 caches especially
Fine-Grain View of Memory
L2 Cache
Big Picture Lost
EPFL, Jan. 2008 4Aenao Group/Toronto
“Big Picture” View
Region: 2n sized, aligned area of memory Patterns and behavior exposed
Spatial locality
Exploit for performance/area/power
Coarse-Grain View of Memory
L2 Cache
EPFL, Jan. 2008 5Aenao Group/Toronto
Exploiting Coarse-Grain Patterns
Many existing coarse-grain optimizations Add new structures to track coarse-grain information
CPU
L2 Cache
Stealth Prefetching
Run-time Adaptive Cache Hierarchy Management via
Reference Analysis
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched
Coherence
Hard to justify for a commercial design
Coarse-Grain Framework
Embed coarse-grain information in tag array
Support many different optimizations with less area overhead
Adaptable optimization FRAMEWORK
EPFL, Jan. 2008 6Aenao Group/Toronto
L2 Cache
RegionTracker Solution
Manage blocks, but also track and manage regions
Tag Array
L1
L1
L1
L1
Data Array
Data Blocks
BlockRequests
Block Requests
RegionTracker
RegionProbes
RegionResponses
EPFL, Jan. 2008 7Aenao Group/Toronto
RegionTracker Summary
Replace conventional tag array: 4-core CMP with 8MB shared L2 cache Within 1% of original performance Up to 20% less tag area Average 33% less energy consumption
Optimization Framework: Stealth Prefetching: same performance, 36% less area RegionScout: 2x more snoops avoided, no area overhead
EPFL, Jan. 2008 8Aenao Group/Toronto
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
Conclusion
EPFL, Jan. 2008 9Aenao Group/Toronto
Goals
1. Conventional Tag Array Functionality Identify data block location and state Leave data array un-changed
2. Optimization Framework Functionality Is Region X cached? Which blocks of Region X are cached? Where? Evict or migrate Region X Easy to assign properties to each Region
EPFL, Jan. 2008 10Aenao Group/Toronto
Coarse-Grain Cache Designs
Increased BW, Decreased hit-rates
Region X
Large Block Size
Tag Array Data Array
EPFL, Jan. 2008 11Aenao Group/Toronto
Sector Cache
Decreased hit-rates
Region X
Tag Array Data Array
EPFL, Jan. 2008 12Aenao Group/Toronto
Sector Pool Cache
High Associativity (2 - 4 times)
Region X
Tag Array Data Array
EPFL, Jan. 2008 13Aenao Group/Toronto
Decoupled Sector Cache
Region information not exposed Region replacement requires scanning multiple entries
Region X
Tag Array Data ArrayStatus Table
EPFL, Jan. 2008 14Aenao Group/Toronto
Design Requirements
Small block size (64B) Miss-rate does not increase Lookup associativity does not increase No additional access latency
(i.e., No scanning, no multiple block evictions)
Does not increase latency, area, or energy Allows banking and interleaving
Fit in conventional tag array “envelope”
EPFL, Jan. 2008 15Aenao Group/Toronto
RegionTracker: A Tag Array Replacement
L1
L1
L1
L1
Data Array
3 SRAM arrays, combined smaller than tag array
RegionVectorArray
BlockStatusTable
EvictedRegionBuffer
EPFL, Jan. 2008 16Aenao Group/Toronto
Basic Structures
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
1 4
Block Status Table(BST)
status
3 2
Address: specific RVA set and BST set RVA entry: multiple, consecutive BST sets BST entry: one of four RVA sets
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
EPFL, Jan. 2008 17Aenao Group/Toronto
Common Case: Hit
Region Tag RVA Index Region OffsetBlock Offset49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
1 4
status
3 2
Data Array + BST Index
To Data Array
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
EPFL, Jan. 2008 18Aenao Group/Toronto
Worst Case (Rare): Region Miss
Region Tag RVA Index Region OffsetBlock Offset
49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
status
3
Ptr
2
Data Array + BST Index
EvictedRegionBuffer(ERB)No
Match!
Ptr
EPFL, Jan. 2008 19Aenao Group/Toronto
Methodology
Flexus simulator from CMU SimFlex group Based on Simics full-system simulator
4-core CMP modeled after Piranha Private 32KB, 4-way set-associative L1 caches Shared 8MB, 16-way set-associative L2 cache 64-byte blocks
Miss-rates: Functional simulation of 2 billion instructions per core Performance and Energy: Timing simulation using SMARTS sampling
methodology Area and Power: Full custom implementation on 130nm commercial
technology 9 commercial workloads:
WEB: SpecWEB on Apache and Zeus OLTP: TPC-C on DB2 and Oracle DSS: 5 TPC-H queries on DB2
Interconnect
L2
P
D$ I$
P
D$ I$
P
D$ I$
P
D$ I$
EPFL, Jan. 2008 20Aenao Group/Toronto
Miss-Rates vs. Area
Sector Cache: 512KB sectors, SPC and RT: 1KB regions Trade-offs comparable to conventional cache
0.99
1
1.01
1.02
1.03
1.04
1.05
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Sector Pool Cache
RegionTracker
Conventional Tags
better
Rela
tive M
iss-
Rate
Relative Tag Array Area
Sector Cache (0.25, 1.26)
14-way 15-way
52-way
48-way
EPFL, Jan. 2008 21Aenao Group/Toronto
Performance & Energy
0.97
0.98
0.99
1.00
1.01
1.02
1.03
WEB OLTP DSS0%
10%
20%
30%
40%
50%
WEB OLTP DSS
12-way set-associative RegionTracker: 20% less area Error bars: 95% confidence interval
Performance within 1%, with 33% tag energy reduction
Norm
aliz
ed E
xecu
tion T
ime
better
Reduct
ion in T
ag E
nerg
y
better
Performance Energy
EPFL, Jan. 2008 22Aenao Group/Toronto
Road Map
Introduction
Goals
Coarse-Grain Cache Designs
RegionTracker: A Tag Array Replacement
RegionTracker: An Optimization Framework
Conclusion
EPFL, Jan. 2008 23Aenao Group/Toronto
RegionTracker: An Optimization Framework
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Stealth Prefetching:Average 20% performance improvement
Drop-in RegionTracker for 36% less area overhead
RegionScout:In-depth analysis
EPFL, Jan. 2008 24Aenao Group/Toronto
Snoop Coherence: Common Case
Main Memory
CPU CPU CPURead x
mis
sm
iss
Read x+1Read x+2Read x+n
Many snoops are to non-shared regions
EPFL, Jan. 2008 25Aenao Group/Toronto
RegionScout
Eliminate broadcasts for non-shared regions
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss
Non-Shared Regions Locally Cached Regions
Read xRead x
RegionMiss
MissMiss
EPFL, Jan. 2008 26Aenao Group/Toronto
RegionTracker Implementation
Minimal overhead to support RegionScout optimization
Still uses less area than conventional tag array
Non-Shared Regions
Add 1 bit to each RVA entry
Locally Cached Regions
Already provided by RVA
EPFL, Jan. 2008 27Aenao Group/Toronto
RegionTracker + RegionScout
0%
10%
20%
30%
40%
50%
60%
RS 7KB RS 12KB RS 22KB RSRT
Reduct
ion in
Snoop B
roadca
sts
better
4 processors, 512KB L2 Caches 1KB regions
Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag
array
BlockScout(4KB)
EPFL, Jan. 2008 28Aenao Group/Toronto
Result Summary
Replace Conventional Tag Array: 20% Less tag area 33% Less tag energy Within 1% of original performance
Coarse-Grain Optimization Framework: 36% reduction in area overhead for Stealth Prefetching Filter 41% of snoop broadcasts with no area overhead
compared to conventional cache
Predictor Virtualization
Ioana Burcea
Joint work with
Stephen Somogyi
Babak Falsafi
EPFL, Jan. 2008 30Aenao Group/Toronto
Predictor Virtualization
Interconnect
L2
CPU CPU
L1-D
L1-I
CPU
L1-D
L1-I
Main Memory
Optimization Engines: Predictors
CPU CPU CPU
L1-D
L1-I
CPU CPU
L1-D L1-I
CPU
L1-D
L1-I
CPU CPU CPUCPU CPU
L1-D
L1-IL1-DL1-IL1-DL1-IL1-D
EPFL, Jan. 2008 31Aenao Group/Toronto
Motivating Trends
Dedicating resources to predictors hard to justify: Chip multiprocessors
Space dedicated to predictors X #processors Larger predictor tables
Increased performance
Memory hierarchies offer the opportunity Increased capacity How many apps really use the space?
Use conventional memory hierarchies to store predictor information
EPFL, Jan. 2008 32Aenao Group/Toronto
PV Architecture contd.
Optimization Engine
Predictor Table
request predictionrequest
EPFL, Jan. 2008 33Aenao Group/Toronto
PV Architecture contd.
Optimization Engine
prediction
Predictor Virtualization
request
EPFL, Jan. 2008 34Aenao Group/Toronto
PV Architecture contd.
Optimization Engine
prediction
+
indexPVStart
PVCache MSHR
PVProxy
L2
Main MemoryPVTable
request
On the backside of the L1
EPFL, Jan. 2008 35Aenao Group/Toronto
To Virtualize Or Not to Virtualize?
1. Re-Use2. Predictor Info Prefetching
Common Case
CPU
I$ D$
interconnect
Main Memory
L2/L3
Infrequent
EPFL, Jan. 2008 36Aenao Group/Toronto
To Virtualize or Not?
Challenge Hit in the PVCache most of the time
Will not work for all predictors out of the box
Reuse is necessary Intrinsic
Easy to virtualize Non-intrinsic
Must be engineered
More so if the predictor needs to be fast to start with
EPFL, Jan. 2008 37Aenao Group/Toronto
Will There Be Reuse?
Intrinsic: Multiple [predictions per entry We’ll see an example
Can be engineered Group temporally correlated entries together:
Cache block
CPU
I$ D$
interconnect
Main Memory
L2/L3
EPFL, Jan. 2008 38Aenao Group/Toronto
Spatial Memory Streaming
Footprint: Blocks accessed per memory region
Predict next time the footprint will be the same Handle: PC + offset within region
EPFL, Jan. 2008 39Aenao Group/Toronto
Spatial Generations
EPFL, Jan. 2008 40Aenao Group/Toronto
Virtualizing SMS
Detector Predictor
patterns
patterns
prefetchestrigger access
Virtualize
EPFL, Jan. 2008 41Aenao Group/Toronto
Virtualizing SMS
VirtualTable1K
11
PVCache8
11
tag pattern
tag tagpattern
pattern0 11 43 54 85 unused
EPFL, Jan. 2008 42Aenao Group/Toronto
Packing Entries in One Cache Block
Index: PC + offset within spatial group PC →16 bits 32 blocks in a spatial group → 5 bit offset
→ 32 bit spatial pattern
Pattern table: 1K sets 10 bits to index the table → 11 bit tag
Cache block: 64 bytes 11 entries per cache block → Pattern table
1K sets – 11-way set associative
21 bit index
tag pattern
tag tagpattern
pattern0 11 43 54 85 unused
EPFL, Jan. 2008 43Aenao Group/Toronto
Memory Address Calculation
+000000
16 bits 5 bits
10 bits
PV Start Address
PC Block offset
Memory Address
EPFL, Jan. 2008 44Aenao Group/Toronto
Simulation Infrastructure
SimFlex: CMU Impetus Full-system simulator based on Simics
Base processor configuration 8-wide OoO 256-entry ROB / 64-entry LSQ L1D/L1I 64KB 4-way set-associative UL2 8MB 16-way set-associative
Commercial workloads TPC-C: DB2 and Oracle TPC-H: Query 1, Query 2, Query 16, Query 17 Web: Apache and Zeus
EPFL, Jan. 2008 45Aenao Group/Toronto
SMS – Performance Potential
0
20
40
60
80
100
120
140
Infin
ite1
K -
16
a1
K -
11
a5
12
-11
a2
56
-11
a1
28
-11
a6
4-1
1a
32
-11
a1
6 -
11
a8
- 1
1a
Infin
ite1
K -
16
a1
K -
11
a5
12
-11
a2
56
-11
a1
28
-11
a6
4-1
1a
32
-11
a1
6 -
11
a8
- 1
1a
Infin
ite1
K -
16
a1
K -
11
a5
12
-11
a2
56
-11
a1
28
-11
a6
4-1
1a
32
-11
a1
6 -
11
a8
- 1
1a
Apache Oracle Qry 17
Pe
rce
nta
ge
L1
Re
ad
Mis
se
s (
%)
Covered Uncovered Overpredictions
better
EPFL, Jan. 2008 46Aenao Group/Toronto
Virtualized Spatial Memory Streaming
-100
1020304050607080
Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17
Per
cent
age
Spe
edup
SMS - 1K sets SMS - 8 sets SMS - PVCache 8 sets
Original Prefetcher: Cost: 60KB
Virtualized Prefetcher: Cost: <1Kbyte
Nearly Identical Performance
better
EPFL, Jan. 2008 47Aenao Group/Toronto
Impact of Virtualization on L2 Misses
0
0.5
1
1.5
2
2.5
Apache Oracle Qry 17Per
cen
tag
e In
crea
se L
2 M
isse
s
PV-8 PV-16 PV-32
EPFL, Jan. 2008 48Aenao Group/Toronto
Impact of Virtualization on L2 Requests
0
10
20
30
40
50
Apache Oracle Qry 17
Perc
enta
ge In
crea
se L
2 Re
ques
ts
PV-8 PV-16 PV-32
Coarse-Grain Tracking
Jason Zebchuk