Cache ImprovementsJames Brock, Joseph SchmigelMay 12, 2006 – Computer Architecture
Outline
Introduction
Reactive-Associative Caches
Non-Uniform Cache Architectures
Conclusion / References
Questions
Cache Problem Domains
Hit Time + Miss Rate * Miss Penalty
Hit Time Time to search and return data
Miss Rate Amount of times needed data is not
in cache and must be fetched from main-memory
Cache Latency Physical delay to move data from
cache to registers
Hit Time / Miss Rate
Searching for cache hits Using Set-Associative caches
causes hit times to increase greatly Multiple ways need to be checked
for a hit and then data in that way needs to be accessed
Miss Rate Direct-Caches have high miss rates Very small changes in miss rate can
effect performance greatly
Latency / Mapping
Latency Cache latency is a primary reason for
multiple layered, complex architectures
Very difficult to improve due to physical limitations
Mapping How data is mapped into cache
(associative, physical location) Better mapping heuristics can reduce
the average search time and latency
Effects of Cache Changes Power
More complex cache architectures will use more power to complete tasks
Time The more complex or larger in size a
cache, the slower it will be Real Estate
Complexity is directly proportional to the number and length of wire traces
Hits / Misses Each change to cache will impact the hit
time and miss rate in some way
Reactive-Associative CachesJoseph Schmigel
Reactive-Associative Caches Attempts to combine direct-mapped and
set-associative cache
Goal is to decrease miss rate while keeping hit times similar to direct-mapped
Avoid disadvantages of each Direct-mapped has high miss rate Set-associative has high hit time
Several major parts: Data array Tag array Probes Way Prediciton Feedback
Data Array & Tag Arrays The Data Array is the actual cache that
stores data
Data Array has two address mappings, one that is direct-mapped, and one that is set-associative (usually 2, 4, or 8 ways)
The tag array has n tag banks where n is the number of ways.
The tag array is used to store the tags of each set-associative index.
Each tag bank is searched in parallel.
Probes Two probes (Probe0 & Probe1) are used to
signal a hit.
Probe0 performs three steps in parallel Looks for a direct-mapped hit Uses way-predicted to find hit Finds hit in tag array
Probe0 tries to keep hit time equal to that of a direct-mapped hit time – only fails if has to use tag array
Probe1 is only used if Probe0 does not find a direct-mapped hit or way-predicted hit. It then returns a hit if there is a correct match in the set-associative cache.
Probes continued
This means that the following possibilities exist: Probe0 hits on direct-mapped and
Probe 1 is ignored Probe0 hits on way-prediction and
Probe1 is ignored Probe0 hits using tag array and the
Probe1 hits using way found from tag array
Probe0 misses and Probe1 is ignored
Way Prediction
Allows the block to be accessed without performing a tag lookup to obtain the way
Keeps hit times comparable to that of direct-mapped
Must be performed early enough so data can be ready in time for pipeline stage that needs it
Prediction can only use information that is currently available in pipeline
Two types of way prediction were used – XOR and Program Counter
XOR Way Prediction Calculates the approximate data access by
XOR’ing the register value with the instruction offset
Works by assuming that the small memory offsets that are pretty common can be XOR’ed and get a reliable block address to use as a prediction
Cannot be done until late in the pipeline because the registers need to be loaded before performing calculation
More accurate then program counter way prediction
Program Counter Way Prediction Associates parts of the cache with the
program counter
Not as accurate as XOR since the program counter does not access the same memory location all times
Program counter is calculated early in the pipeline so it is easier to make the predictions
Not as accurate as XOR
Feedback 3 types of feedback
Reactive displacement Eviction of unpredictable blocks Eviction of hard to predict blocks
Feedback tries to maximize bandwidth and minimize hit latency Highly predictable blocks are used in
the set-associative cache Blocks that can not be predicted
reliably are kept in direct-mapped cache
Non-Uniform Cache ArchitecturesJames Brock
Cache Organization Multiple Layer Cache
Hierarchical organization designed for faster accesses to layers of cache closer to the core
Replacement policies are static i.e. – Replacements cause one
insertion, one eviction at the same location in cache
Uniform Cache Cache Architecture is physically laid
out in uniformly distributed banks and sub-banks.
AMD64 Cache Design (K8 Core)
Problem Domain
CPU’s are becoming wire-delay dominated As the core speed of CPU’s
increases, the latency of transmission delays has a greater effect on overall performance
2 possible paths Reduce the latency of wire traces
(physical limitations) Use latency in the design, and
optimize
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) All designs were modeled for L2
cache, but can be scaled to work as any layer
Uniform cache latency is as fast as the slowest bank
Non-uniformity in cache uses the latency of (sub)banks further from the decoder for better performance
S-NUCA Static means that the data in main
memory is mapped to 1 … n locations in cache, where n = associativity.
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA)
S-NUCA1 S-NUCA2
Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) S-NUCA 1
Individual data and address channels for each bank
Multiple banks can be accessed in parallel
HUGE real estate cost to add channels for each bank
S-NUCA 2 Mesh grid of data and address channels Switches at each intersection access
multiple sub-banks in parallel and arbitrate data flow
Solution 2: Dynamic NUCA(D-NUCA) Dynamic refers to the ranking and
movement of cache lines within the banks and sub-banks
Replacement policy is not a insert & evict Insertion, Demotion, Eviction based
on the replacement heuristic Example least recently used!
With D-NUCA, mapping, searching, and line movement problems expand
Suggested Mappings
D-NUCA Mapping & Searching Uses spread sets of banks # of banks in a set = associativity of
the cache Simple Mapping
Search by set, bank, tags within the set Some sets are further then others, rows
may not be desired number of ways Fair Mapping
Fixes problems in simple mapping, but more complex
Equal access times to all banks
D-NUCA Mapping & Searching Shared Mapping
Closest banks are shared with the farthest set
If n sets share a bank, then all banks in the cache are n-way associative
Slightly higher bank associativity offsets average access latency
Cache lines from farther bank sets are located right next to cache controller
D-NUCA Mapping & Searching Locating cache lines
Incremental Search – one bank at a time
Low power, less messages on cache network
Low performance Multicast Search – some/all banks
at the same time More power, more network contention Faster hits to farther banks
D-NUCA Mapping & Searching Hybrid Searches – combos!
Limited Multicast Multicast of M banks in each bank set
in parallel M < N
Partitioned Multicast Similar to multi-level set-associative
caches Each bank set is broken up into subsets Multicast searches are performed on
each subset, starting with the closest subset
D-NUCA Line Movement Goal of D-NUCA is to maximize
hits in the closest banks LRU policy is applied to mapping
lines within a bank MRU lines is closest to the cache
controller Replacement Policy –
Generational Promotion A cache hit causes that line to be
moved one line closer to the cache controller
D-NUCA Line Movement Generational Promotion (cont’d)
More heavily used lines, thus migrate towards the cache controller
Eviction / Insertion policy shouldn’t simply eject the LRU line and insert the new line in that spot
New lines are inserted towards the middle of the bank set, and allowed to progress forward or back
The victim line can be evicted or simply demoted, with a less important line being evicted.
Performance Improvement
Conclusion
Cache improvements are often more work then the benefits they offer Complexity causes speed decrease
which limits usefulness Implementing complex caching
structure does not usually provide a good cost/benefit ratio for companies
Research still being done and are useful in the theoretical world
References[1] Changkyu Kim, Doug Burger, Stephen
Keckler. \textbf{An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches}. Computer Architecture and Technology Laboratory, U of Texas, Austin.
[2] http://en.wikipedia.org/wiki/CPUcache[3] B. Batson, and TN. Vijaykumar. Reactive
associative caches. In Int. Conf. on Parallel Architectures and Compilation Techniques, Sep. 2001.
[4] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Apporach. Morgan Kaufmann, 2003. Third Edition, Chapter Five.
References[5] B. Calder, D. Grunwald, and J. Ermer.
Predictive sequential associative cache. In Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture, Feb. 1996
[6] B. Calder, and D. Grunwald. Next cache line and set prediction. In Proceedings of the 20th International Symposium n Computer Architecture, June 1995
[7] A. Agarwal and S. Pudar. Column associative caches: A technique for reducing miss rate of direct-mapped caches. In Proceedings of the 20th International Symposium n Computer Architecture, May 1993
Questions/Comments