Upload
bryan-hutchinson
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
1
University of Utah & HP Labs 1
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Naveen Muralimanohar
Rajeev Balasubramonian
Norman P Jouppi
2University of Utah 2
Large Caches
Cache hierarchies will dominate chip
area
3D stacked processors with an entire
die for on-chip cache could be
common
Montecito has two private 12 MB L3
caches (27MB including L2)
Long global wires are required to
transmit data/address
Intel Montecito
Cache Cache
3University of Utah 3
Wire Delay/Power
Wire delays are costly for performance and power
Latencies of 60 cycles to reach ends of a
chip at 32nm (@ 5 GHz)
50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
CACTI* access time for 24 MB cache is 90 cycles
@ 5GHz, 65nm Tech
*version 4
Contribution
Support for various interconnect models Improved design space exploration
Support for modeling Non-Uniform Cache Access (NUCA)
University of Utah 4
5University of Utah 5
Cache Design Basics
Input address
Dec
oderWordline
Bitlines
Tag
arr
ay
Dat
a ar
ray
Column muxesSense Amps
Comparators
Output driver
Valid output?
Mux drivers
Data output
Output driver
6University of Utah 6
Existing Model - CACTI
Decoder delay Decoder delay
Wordline & bitline delay Wordline & bitline delay
Cache model with 4 sub-arrays Cache model with 16 sub-arrays
Decoder delay = H-tree delay + logic delay
7
Power/Delay Overhead of Wires
0%
10%
20%
30%
40%
50%
60%
70%
2 4 8 16 32Cache Size (MB)
H-tree delay percentage
H-tree power percentage H-tree delay increases
with cache size
H-tree power continues
to dominate
Bitlines are other major
contributors to total
power
8
Motivation
The dominant role of interconnect is clear
Lack of tool to model interconnect in detail
can impede progress
Current solutions have limited wire options
Orion, CACTI
- Weak wire model
- No support for modeling Multi-megabyte caches
University of Utah 8
9
CACTI 6.0 Enhancements
Incorporation of Different wire models
Different router models
Grid topology for NUCA
Shared bus for UCA
Contention values for various cache configurations
Methodology to compute optimal NUCA organization
Improved interface that enables trade-off analysis
Validation analysis
University of Utah 9
10
Full-swing Wires
University of Utah 10
X Y
Z
11
Full-swing Wires II
University of Utah 11
10% Delay
penalty 20% Delay
penalty30% Delay
penaltyRepeater size
Caveat: Repeater sizing and spacing cannot
be controlled precisely all the time
Three different design points
12
Full-Swing Wires
Fast and simple Delay proportional to sqrt(RC) as against RC
High bandwidth Can be pipelined
- Requires silicon area
- High energy- Quadratic dependence on voltage
13
Low-swing wires
University of Utah 13
400mV
50mV
raise
Differential wires50mV
drop
400mV
400mV
14
Differential Low-swing
+ Very low-power, can be routed over other modules
- Relatively slow, low-bandwidth, high area requirement, requires special transmitter and receiver
Bitlines are a form of low-swing wireOptimized for speed and area as against powerDriver and pre-charger employ full Vdd voltage
University of Utah 14
15
Delay Characteristics
University of Utah 15
Quadratic increase in delay
16
Energy Characteristics
University of Utah 16
17
Search Space of CACTI-5
University of Utah 17
Design space with global wires optimized for delay
18
Search Space of CACTI-6
University of Utah 18
Design space with global and low-swing wires
Least Delay
30% Delay
Penalty
Low-swing
19University of Utah 19
CACTI – Another Limitation
Access delay is equal to the delay of slowest sub-array Very high hit time for large caches
Employs a separate bus for each cache bank for multi-banked caches Not scalable
Exploit different wire types and network
design choices to improve the search space
Potential solution – NUCA
Extend CACTI to model NUCA
20University of Utah 20
Non-Uniform Cache Access (NUCA)*
Large cache is broken into
a number of small banks
Employs on-chip network
for communication
Access delay (distance
between bank and cache
controller)
CPU & L1
Cache banks*(Kim et al. ASPLOS 02)
21University of Utah 21
Extension to CACTI
On-chip network Wire model based on ITRS 2005 parameters
Grid network
3-stage speculative router pipeline
Network latency vs Bank access latency tradeoff Iterate over different bank sizes
Calculate the average network delay based on the number of banks and bank sizes
Consider contention values for different cache configurations
Similarly we also consider power consumed for each organization
22
Trade-off Analysis (32 MB Cache)
0
50
100
150
200
250
300
350
400
2 4 8 16 32 64No. of Banks
La
ten
cy
(c
yc
les
)
Total No. of Cycles
Network Latency
Bank access latency
Network contention Cycles
16 Core CMP
23
Effect of Core Count
0
50
100
150
200
250
300
2 4 8 16 32 64
Bank Count
Co
nte
nti
on
Cyc
les
16-core
8-core
4-core
24University of Utah 24
Power Centric Design (32MB Cache)
0.E+00
1.E-09
2.E-09
3.E-09
4.E-09
5.E-09
6.E-09
7.E-09
8.E-09
9.E-09
1.E-08
2 4 8 16
32
64
En
erg
y J
Bank Count
Total EnergyBank EnergyNetwork Energy
Power Optimal Point
Validation
HSPICE tool
Predictive Technology Model (65nm tech.)
Analytical model that employs PTM
parameters compared against HSPICE
Distributed wordlines, bitlines, low-swing
transmitters, wires, receivers
Verified to be within 12%University of Utah 25
26
Case Study: Heterogeneous D-NUCA
Dynamic-NUCA Reduces access time by dynamic data movement
Near-by banks are accessed more frequently
Heterogeneous Banks Near-by banks are made smaller and hence
faster
Access to nearby banks consume less power
Other banks can be made larger and more power efficient
27
Access Frequency
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
32
,76
8
3,3
09
,56
8
6,5
86
,36
8
9,8
63
,16
8
13
,13
9,9
68
16
,41
6,7
68
19
,69
3,5
68
22
,97
0,3
68
26
,24
7,1
68
29
,52
3,9
68
32
,80
0,7
68
% request satisfied by x KB of cache
Few Heterogeneous Organizations Considered by CACTI
University of Utah 28
Model 1
Model 2
29
Other Applications
Exposing wire properties Novel cache pipelining
Early lookup, Aggressive lookup (ISCA 07)
Flit-reservation flow control (Peh et al., HPCA 00)
Novel topologies Hybrid network (ISCA 07)
30
Conclusion
Network parameters and contention play a critical role in deciding NUCA organization
Wire choices have significant impact on cache properties
CACTI 6.0 can identify models that reduce power by a factor of three for a delay penalty of 25%
http://www.hpl.hp.com/personal/Norman_Jouppi/cacti6.html
http://www.cs.utah.edu/~rajeev/cacti6/