14
0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TC.2014.2360518, IEEE Transactions on Computers SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 1 Size-Aware Cache Management for Compressed Cache Architectures Seungcheol Baek, Student Member, IEEE, Hyung Gyu Lee, Member, IEEE, Chrysostomos Nicopoulos, Member, IEEE, Junghee Lee, Member, IEEE, and Jongman Kim, Member, IEEE Abstract—A practical way to increase the effective capacity of a microprocessor’s cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since the primary purpose of the LLC is to minimize the miss rate, i.e., it directly benefits from a larger logical capacity. In compressed LLCs, the cacheline size varies depending on the achieved compression ratio. Our observations indicate that this size information gives useful hints when managing the cache (e.g., when selecting a victim), which can lead to increased cache performance. However, there are currently no replacement policies tailored to compressed LLCs; existing techniques focus primarily on locality information. This article introduces the concept of size-aware cache management as a way to maximize the performance of compressed caches. Upon analyzing the benefits of considering size information in the management of compressed caches, we propose a novel mechanism – called Effective Capacity Maximizer (ECM) – to further enhance the performance and energy consumption of compressed LLCs. The proposed technique revolves around four fundamental principles: ECM Insertion (ECM-I), ECM Promotion (ECM-P), ECM Eviction Scheduling (ECM-ES), and ECM Replacement (ECM-R). Extensive simulations with memory traces from real applications running on a full-system simulator demonstrate significant improvements compared to compressed cache schemes employing conventional locality-aware cache replacement policies. Specifically, our ECM shows an average effective capacity increase of 18.4% over the Least-Recently Used (LRU) policy, and 23.9% over the Dynamic Re-Reference Interval Prediction (DRRIP) [1] scheme. This translates into average system performance improvements of 7.2% over LRU and 4.2% over DRRIP. Moreover, the average energy consumption is also reduced by 5.9% over LRU and 3.8% over DRRIP. Index Terms—Cache, Compression, Data Compression, Cache Compression, Cache Replacement Policy 1 I NTRODUCTION The Chip Multi-Processor (CMP) paradigm has cemented itself as the archetypal philosophy of future microprocessor design. Rapidly diminishing technology feature sizes have en- abled the integration of ever-increasing numbers of processing cores on a single chip die. This abundance of processing power has magnified the venerable processor-memory performance gap, which is known as the ”memory wall.” The logic-memory chasm is a limiting factor in overall system performance, thus necessitating the presence of a high-performance Last-Level Cache (LLC) as a mitigating factor. Indeed, a well-designed LLC is one of the most effective ways to hide the off-chip memory latency. Burgeoning transistor integration densities [2] have allowed architects to respond to this challenge by dedicating increasing portions of the CPU’s real estate to the LLC. Such large LLCs are becoming common in the workstation and server segments. Even though modern archi- tectures provide large LLCs to improve system performance, the size is not large enough to completely hide slow off- chip memory latencies, because the working-set size of most applications tends to increase over time, as well. Thus, one of the most important research challenges so far in the field of microprocessor design is how to effectively and efficiently utilize a given cache size. Employing data compression within the LLC is one of the most attractive solutions to cope with increasingly larger work- ing sets, because storing compressed data in a cache increases the effective (logical) cache capacity 1 , without physically increasing the cache size. Accordingly, this increased effective Seungcheol Baek and Jongman Kim are with the Georgia Institute of Tech- nology, USA. Hyung Gyu Lee is with the Daegu University, South Korea. Chrysostomos Nicopoulos is with the University of Cyprus, Cyprus. Junghee Lee is with the University of Texas at San Antonio, USA. 1. The term effective capacity is used to denote the effective size of the compressed cache. It is computed by multiplying the number of all valid compressed cachelines in the cache (at any given time) with the size of a single cacheline. The effective capacity changes during the execution of an application, depending on the achievable compression ratio during the execution. cache capacity can hold a larger working set and thereby improve the system performance significantly. This benefit has led researchers to develop various compressed LLC architec- tures by designing efficient compression algorithms [3]–[8], or by architecting compression-aware cache structures to ease the allocation and management of variable-sized compressed cachelines [4], [8]–[14]. Most of the prior efforts, however, fo- cused on minimizing the inherent deficiencies of compression- based schemes, such as compression/decompression latency, address remapping, and compaction 2 overhead. While these studies have led to significant improvements and have en- abled the efficient use of compressed LLCs, no prior work has developed a cache management policy (mainly cacheline replacement policy) tailored to compressed caches. In other words, existing cache management policies for compressed caches employ the traditional cacheline replacement policies, whereby only locality information is considered. However, our experiments in this work will clearly show that the conven- tional cache replacement policies suffer from a fundamental weakness when applied to compressed caches: they fail to account for the cachelines’ variable size, and, thus, they do not fully utilize the benefits of compressed caching. In sharp contrast to conventional cache architectures – where the cacheline size is constant – the physical size of a stored cacheline after compression varies according to the achieved compression ratio. This means that the eviction over- head (miss penalty) in a compressed LLC will also vary, based on the size of the evicted and evictee cachelines. This work will demonstrate that if the size information of the compressed cacheline is considered in the cache management process, the increase in the effective capacity of the compressed LLC will be maximized, while the eviction cost will be minimized. Sec- tion 3 will present a detailed motivational example to illustrate the benefits of size awareness. Hence, in addition to locality information, the size information of the cachelines should also be one of the prime determinants in managing the cachelines 2. Compaction is the process of removing fragmentation in a cache set. Note that compaction and compression are two distinct processes.

Size-Aware Cache Management for Compressed Cache Architectures

  • Upload
    jongman

  • View
    212

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 1

Size-Aware Cache Management forCompressed Cache Architectures

Seungcheol Baek, Student Member, IEEE, Hyung Gyu Lee, Member, IEEE,Chrysostomos Nicopoulos, Member, IEEE, Junghee Lee, Member, IEEE,

and Jongman Kim, Member, IEEE

Abstract—A practical way to increase the effective capacity of a microprocessor’s cache, without physically increasing the cachesize, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since theprimary purpose of the LLC is to minimize the miss rate, i.e., it directly benefits from a larger logical capacity. In compressed LLCs,the cacheline size varies depending on the achieved compression ratio. Our observations indicate that this size information givesuseful hints when managing the cache (e.g., when selecting a victim), which can lead to increased cache performance. However, thereare currently no replacement policies tailored to compressed LLCs; existing techniques focus primarily on locality information. Thisarticle introduces the concept of size-aware cache management as a way to maximize the performance of compressed caches. Uponanalyzing the benefits of considering size information in the management of compressed caches, we propose a novel mechanism –called Effective Capacity Maximizer (ECM) – to further enhance the performance and energy consumption of compressed LLCs. Theproposed technique revolves around four fundamental principles: ECM Insertion (ECM-I), ECM Promotion (ECM-P), ECM EvictionScheduling (ECM-ES), and ECM Replacement (ECM-R). Extensive simulations with memory traces from real applications runningon a full-system simulator demonstrate significant improvements compared to compressed cache schemes employing conventionallocality-aware cache replacement policies. Specifically, our ECM shows an average effective capacity increase of 18.4% over theLeast-Recently Used (LRU) policy, and 23.9% over the Dynamic Re-Reference Interval Prediction (DRRIP) [1] scheme. This translatesinto average system performance improvements of 7.2% over LRU and 4.2% over DRRIP. Moreover, the average energy consumptionis also reduced by 5.9% over LRU and 3.8% over DRRIP.

Index Terms—Cache, Compression, Data Compression, Cache Compression, Cache Replacement Policy

1 INTRODUCTION

The Chip Multi-Processor (CMP) paradigm has cementeditself as the archetypal philosophy of future microprocessordesign. Rapidly diminishing technology feature sizes have en-abled the integration of ever-increasing numbers of processingcores on a single chip die. This abundance of processing powerhas magnified the venerable processor-memory performancegap, which is known as the ”memory wall.” The logic-memorychasm is a limiting factor in overall system performance, thusnecessitating the presence of a high-performance Last-LevelCache (LLC) as a mitigating factor. Indeed, a well-designedLLC is one of the most effective ways to hide the off-chipmemory latency. Burgeoning transistor integration densities[2] have allowed architects to respond to this challenge bydedicating increasing portions of the CPU’s real estate tothe LLC. Such large LLCs are becoming common in theworkstation and server segments. Even though modern archi-tectures provide large LLCs to improve system performance,the size is not large enough to completely hide slow off-chip memory latencies, because the working-set size of mostapplications tends to increase over time, as well. Thus, oneof the most important research challenges so far in the fieldof microprocessor design is how to effectively and efficientlyutilize a given cache size.Employing data compression within the LLC is one of the

most attractive solutions to cope with increasingly larger work-ing sets, because storing compressed data in a cache increasesthe effective (logical) cache capacity1, without physicallyincreasing the cache size. Accordingly, this increased effective

Seungcheol Baek and Jongman Kim are with the Georgia Institute of Tech-

nology, USA. Hyung Gyu Lee is with the Daegu University, South Korea.

Chrysostomos Nicopoulos is with the University of Cyprus, Cyprus. JungheeLee is with the University of Texas at San Antonio, USA.

1. The term effective capacity is used to denote the effective size of thecompressed cache. It is computed by multiplying the number of all validcompressed cachelines in the cache (at any given time) with the size ofa single cacheline. The effective capacity changes during the execution ofan application, depending on the achievable compression ratio during theexecution.

cache capacity can hold a larger working set and therebyimprove the system performance significantly. This benefit hasled researchers to develop various compressed LLC architec-tures by designing efficient compression algorithms [3]–[8],or by architecting compression-aware cache structures to easethe allocation and management of variable-sized compressedcachelines [4], [8]–[14]. Most of the prior efforts, however, fo-cused on minimizing the inherent deficiencies of compression-based schemes, such as compression/decompression latency,address remapping, and compaction2 overhead. While thesestudies have led to significant improvements and have en-abled the efficient use of compressed LLCs, no prior workhas developed a cache management policy (mainly cachelinereplacement policy) tailored to compressed caches. In otherwords, existing cache management policies for compressedcaches employ the traditional cacheline replacement policies,whereby only locality information is considered. However, ourexperiments in this work will clearly show that the conven-tional cache replacement policies suffer from a fundamentalweakness when applied to compressed caches: they fail toaccount for the cachelines’ variable size, and, thus, they donot fully utilize the benefits of compressed caching.In sharp contrast to conventional cache architectures –

where the cacheline size is constant – the physical size ofa stored cacheline after compression varies according to theachieved compression ratio. This means that the eviction over-head (miss penalty) in a compressed LLC will also vary, basedon the size of the evicted and evictee cachelines. This workwill demonstrate that if the size information of the compressedcacheline is considered in the cache management process, theincrease in the effective capacity of the compressed LLC willbe maximized, while the eviction cost will be minimized. Sec-tion 3 will present a detailed motivational example to illustratethe benefits of size awareness. Hence, in addition to localityinformation, the size information of the cachelines should alsobe one of the prime determinants in managing the cachelines

2. Compaction is the process of removing fragmentation in a cache set.Note that compaction and compression are two distinct processes.

Page 2: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 2

in compressed caches and in identifying appropriate evictionvictims. In order to maximize the effective cache capacity andminimize the eviction overhead under compression, the twosalient properties of size and locality must be appropriatelycombined, since they are inherently inter-twined.

The realization of the importance of cacheline size aware-ness constitutes the primary motivation and driver of thepresented work. This article proposes the notion of a size-aware cache management policy to allow compressed cachesto maintain the maximum possible effective cache capacityat all times. The embodiment of this concept is the EffectiveCapacity Maximizer (ECM), which is a mechanism targetingcompressed LLC architectures. The ECM architecture revolvesaround four fundamental and closely intertwined policies:ECM Insertion (ECM-I), ECM Promotion (ECM-P), ECMEviction Scheduling (ECM-ES), and ECM Replacement (ECM-R). The ECM-I and ECM-P policies efficiently manage thepriority of eviction when a new cacheline is inserted, orre-referenced, in the cache. Specifically, these two policiesgive “big-size” cachelines (i.e., cachelines that are still largeafter compression) higher priority to be evicted, so that thecache is occupied primarily by “small-size” cachelines (i.e.,cachelines that have significantly smaller size after compres-sion). Working in parallel, the ECM-ES and ECM-R policiesselect a victim – among the cachelines – that has the highesteviction priority. In particular, they aim to mostly evict big-size cachelines, in order to minimize the eviction overheadand maximize the effective capacity.

In all four policies, the size information is carefully consid-ered in conjunction with the locality information, in order tomaximize the effective capacitive of the cache, while still sus-taining high hit rates. All the proposed cacheline managementpolicies comprising the ECM mechanism are very lightweight,in terms of hardware implementation, because all techniquesinvolved mostly re-use components already implemented inthe existing compressed LLC infrastructure. To the best ofour knowledge, this is the first work to: (1) highlight theimportance of size information in the cache managementpolicy of compressed caches, and (2) devise a collection ofcache policies that seamlessly and synergistically combine sizeand locality information, in order to maximize the performanceof compressed caches.

To validate the efficacy and efficiency of the proposed ECMmechanism, we perform extensive simulations using memorytraces extracted from real multi-threaded workloads running ona full-system simulation framework. Specifically, simulationswith a 2 MB compressed LLC configured as physically 4-wayset associative, and logically up to 16-way, indicate that ourECM increases the effective cache capacity in a compressedLLC by an average of 19.4% and 23.9%, as compared tothe Least-Recently Used (LRU) and Dynamic Re-ReferenceInterval Prediction (DRRIP) [1] policies, respectively. Thiseffective capacity improvement increases cache performanceby reducing the number of misses by an average of 11.3%over LRU and 5.6% over DRRIP. As a result, the ECMtechnique improves overall system performance (in terms ofInstructions Per Cycle, IPC) by 7.2% and 4.2%, as comparedto a compressed LLC using the LRU and DRRIP replacementpolicies, respectively. Furthermore, detailed energy analysisindicates that ECM lowers energy consumption by 5.9% and3.8%, respectively, over LRU and DRRIP.

The rest of the article is organized as follows: Section 2provides background information and related work in LLCcompression techniques. Section 3 presents a motivationalexample for the importance of cacheline size information inthe cache management policy of compressed caches. Section 4delves into the description, implementation, and analysis of the

proposed ECM architecture. Section 5 describes the employedevaluation framework, while Section 6 presents the variousexperiments and accompanying analysis. Finally, Section 7concludes the paper.

2 PREAMBLE & RELATED WORK

2.1 Overview of Data Compression Techniques andLLC Cache Replacement Policies

The benefits of data compression are easily observable inLLC memory architectures, where capacity is one of the mostsensitive factors. As such, compression constitutes one of themost effective ways to increase the logical capacity of thememory system without increasing its physical capacity. Therehave been many studies on the employment of compressiontechniques within the LLC micro-architecture. So far, thefocus has been on designing the compression-aware cachestructure itself, rather than the compression algorithm. Existingapproaches can be classified into two broad categories: (1)variable-segment caches, and (2) fixed-segment caches. Bothdesign options try to optimize the cache architecture aroundthe use of the employed compression algorithm.The main drawback of variable-segment caches is the re-

quirement of non-negligible space for remapping and com-paction, which is pure overhead [10], [14]. Since the numberof occupied segments by the compressed data is variable,depending on the compression ratio, the latter directly impactsthe effective capacity. Due to this variability in occupiedsegments, variable-segment caches are very effective in maxi-mizing the effective capacity of the cache. On the other hand,fixed-segment caches aim to exploit compression without anysevere cacheline compaction or manipulation overhead, byfixing the number of segments occupied by the compressedcacheline (usually up to 2 or 4) [4], [8], [9], [13]. Due tothe fixed segment size, if the compressed data size is verysmall, some segments might have empty space, which isnot available to other cachelines. Our experiments indicatethat this situation happens frequently under most compressionalgorithms – irrespective of their complexity. Most studiesexploit zero-value compression [3], [7], [8], which achieveshigh compressibility, but still produces a lot of unusableempty space within the cache. This implies that compressionalgorithms do not fully utilize the physical space of the cache,even when the compression scheme is highly efficient. Sinceour focus is on maximizing the effective capacity, as wellas reducing the cache miss penalty, we will mainly exploitthe variable-segment cache architecture for the rest of thispaper. However, it should be noted that the proposed ECMwill also work in a fixed-segment cache architecture, becausethe underlying concepts are applicable to both design options.While the cache size significantly affects system perfor-

mance, the cacheline replacement policy (i.e., victim selectionwhen cache misses occur) is also an important factor affectingoverall performance. There has been extensive research indeveloping efficient cache management policies [1], [15]–[18].Generally, it is accepted that the LRU replacement policy(and its approximations) behaves relatively well with mostapplications and access patterns (e.g., recency-friendly andsequential streaming). However, the LRU incurs performancedegradation under thrashing and mixed-access patterns. Toaddress these problems, one recent study proposed a Dy-namic Re-Reference Interval Prediction (DRRIP) replacementmechanism [1]. The DRRIP comprises two cache managementpolicies: Bimodal RRIP (BRRIP) and Static RRIP (SRRIP).The BRRIP policy specifically addresses the performanceof thrashing access patterns, while the SRRIP policy aimsto improve the performance of mixed-access patterns. The

Page 3: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 3

Fig. 1. Abstract view of one particular set of a decoupledvariable-segment cache.

performance of DRRIP is primarily shaped by how well itperforms under the two aforementioned cache managementpolicies, since it relies on a Set Dueling method [17] to selectthe best performing among the two policies.Despite enhancing the cache performance well beyond the

levels achieved by the conventional LRU policy – with lowimplementation cost – these advanced cache managementalgorithms still focus on exploiting locality (re-reference rate)information only. While this is perfectly adequate in conven-tional cache architectures, in compressed cache architectures,where the size of each cacheline is variable and a functionof the compression ratio, the size information should also beconsidered, because it directly determines the effective cachesize and the miss penalty. Nevertheless, no previous workattempted to develop a cache replacement policy targetingcompressed caches. To the best of our knowledge, this isthe first work to propose a compressed cache architectureaugmented with a customized cache management policy thatis aware of both the variable cacheline size information andthe locality information. It will be demonstrated that thiscombined locality-and-size awareness yields much improvedperformance results.

2.2 Basics of Decoupled Variable-Segment CacheArchitectures

A high-level, abstract view of a single set of a decoupledvariable-segment cache architecture [10] with a 64B cachelinesize is illustrated in Figure 1. While each cache set is physi-cally 4-way set associative, logical 16-way set associativity canbe achieved by using more tags alongside a variable-segmenteddata area. More specifically, each set of the cache is brokeninto 64 segments and the size of each segment is only 4 bytes(single-word). The effective capacity for a single set is givenby:

physically 4-way ≤ effective capacity ≤ logically 16-way.

The physically 4-way term comes from the size of DataArea, while logically 16-way is achieved through the sizeof Tag Area. For instance, if there are only uncompressedcachelines, the cache would operate like a typical 4-wayset associative cache. Conversely, if there are only highlycompressed cachelines, the cache would operate with 16-wayset associativity. Therefore, each set can potentially increaseits effective capacity by up to four times (when storing 16compressed cachelines). Of course, one could have more than16 cachelines in a set with larger tag space. The scalabilityof the mechanism proposed in this work with the number oflogical ways will be explored in Section 6.3. Data segmentsare stored contiguously in Address Tag order. The offset forthe first data segment of cacheline k (in a particular set) is

segment offset(k) =k−1∑

i=0

actual size(i).

The Cacheline size tag in the “Tag Area” of Figure 1 isused to record the actual physical size of each cacheline (i.e.,number of segments) in each set of the compressed LLC.

3 A SIZE-AWARE CACHE MANAGEMENT POL-ICY FOR COMPRESSED CACHES

3.1 The Motivation for Size Awareness

The cache replacement policies in conventional cache architec-tures are optimized to minimize the off-chip memory accessesby considering the data access patterns during execution. How-ever, as previously mentioned, in a compressed cache, the sizeinformation of the cacheline is equally important and shouldbe exploited in the optimization of both the cache structureitself and the cache replacement policy. While the cachelinesize is a constant parameter in conventional caches, it becomesa performance-critical variable under data compression.In order to quantify the ramifications of ignoring size infor-

mation in compressed caches, a simple walk-through exampleis presented in Figure 2. This example scenario compares thebehavior of a compressed cache under (a) the size-agnosticLRU replacement policy, and (b) a size-aware replacementpolicy. The cache is assumed to be physically configured as a2-way set associative cache, but logically configured as an 8-way set associative decoupled variable-segment cache. WhenU2 – a new cacheline – is requested, a cache miss occurs.Four cachelines (C1, C3, C4, and U1) should be evicted underthe LRU replacement policy. These evictions will lead to analmost four-fold increase in the miss penalty, as comparedto the penalty of a conventional cache without compression.After the evictions of said 4 cachelines, C2 will also beevicted when the second U1 request arrives. In addition, acompaction process to allocate U1 should be performed inthe data area, which results in non-negligible overhead indecoupled variable-segment caches. More importantly, at thismoment, the Data Area contains only 2 cachelines, which isthe minimum of its potential effective capacity. Even worse,the cache will experience 4 consecutive misses in the next 4requests (C1, C3, C4, and C2). Hence, in this conventionalconfiguration, a total of 6 cache misses and 6 cachelineevictions are observed, and the average number of cachelinesthat the cache set holds is 3.29. As depicted in Figure 2, theLRU replacement policy causes substantial under-utilizationof the available cache space.This undesired phenomenon may be remedied by also

considering the size information of the evicted cacheline. Asopposed to the previous example, we evict only U1 – whichis the largest-size cacheline in the Data Area – when U2 firstarrives. C1, C3, C4, and C2 can stay in the cache. Even thoughU2 will again be replaced by U1 at the next request for U1,the following four requests (C1, C3, C4, and C2) after U1will hit in the cache. Thus, in total, only 2 cache misses and2 cacheline evictions are observed, and the average number ofcachelines in a cache set increases to 5.0 in this configuration.Obviously, this size-aware policy yields much better utilizationof the available cache space.This example emphatically demonstrates that it is imperative

to consider the inherent variability in the cacheline sizes ofcompressed caches. If the size and locality information are notappropriately balanced in the cache management policy, theperformance enhancements made possible by the compressionscheme will not be maximized.

3.2 The Potential Conflicts Between Locality andSize Information

Having established the importance of both the locality andthe cacheline size attributes in managing compressed LLCs, itis clear that the main challenge in this work is the effectivesimultaneous consideration of these two pieces of information,in order to maximize the benefits afforded by compression.Even though the locality and size information of a certain

Page 4: Size-Aware Cache Management for Compressed Cache Architectures
Page 5: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 5

Fig. 4. A high-level overview of the proposed ECM architec-ture. The operation of the new size-aware cache managementmechanism can be decomposed into three distinct phases (greyboxes), which include the four size-aware policies introduced inthis article: (1) insertion (ECM-I), (2) promotion/demotion (ECM-P), and (3) eviction (ECM-ES+ECM-R).

and the locality, but a non-negligible number of type (2) cases(high locality and big size) are observed when the cachelinesize is more than 48 B in most applications. Hence, our policyaims to maximize the benefits even under these conflictingcases. Our exploration with the entire PARSEC benchmarksuite [19] indicates that significant numbers of such conflictingcases are invariably present in all applications.Evaluations with different benchmarks indicate that, for

some applications, there is a clear relationship between aparticular block size (or sizes) and locality [20], e.g., aspecific block size may exhibit very low locality. In order totake advantage of this observation, ECM’s size classificationscheme would need to operate at a finer granularity and utilizemore than two sizes. Such extension would lead to furtherperformance optimization, but it would be more costly in termsof hardware and complexity. In this work, we rely on simple,binary size classification.

4 THE EFFECTIVE CAPACITY MAXIMIZER

(ECM) MECHANISM FOR COMPRESSED LLCS

The main contribution of this work is the adoption of size-awareness into the cache management policy of compressedcaches. Not only is the size information useful, it is alsodemonstrated to constitute a fundamental attribute in deter-mining the overall efficacy and efficiency of a compressedLLC, as explained in the previous sections. The exploitationof cacheline size information when selecting a victim is anecessary condition to maximize the benefits of compression;namely, enhancing the effective capacity – which implies areduction in the number of misses – and reducing the misspenalty. In this section, we propose a size- and locality- awarecompressed cache management scheme, called the EffectiveCapacity Maximizer (ECM), to further increase the perfor-mance of compressed LLCs. ECM’s operation revolves aroundfour policies (ECM-I, ECM-P, ECM-ES, and ECM-R), whichconsider size information in all pertinent aspects of cachemanagement: insertion, promotion, eviction scheduling, andreplacement. The four basic policies introduced in this articlecan be used in conjunction with most existing locality-awarereplacement policies. However, in this paper, we partiallyexploit the basic framework of RRIP [1], which achieves highperformance at a fairly low implementation cost, as comparedto the other conventional cache management implementations.Note that the RRIP framework itself does not consider thesize information of the cachelines; similar to other cachemanagement policies, RRIP focuses on locality-awareness.Figure 4 illustrates the fundamental components of the ECM

framework, which enable the simultaneous consideration of

locality and size information. The main operation of the pro-posed cache management mechanism can be decomposed intothree distinct phases (grey boxes in Figure 4), which includethe four size-aware cacheline management policies introducedin this article: (1) insertion (ECM-I), (2) promotion/demotion(ECM-P), and (3) eviction (ECM-ES+ECM-R). In each phase,the size information of the cacheline is effectively considered,together with the locality information, by markedly enhancingthe original RRIP framework [1]. For example, the ECM-I and ECM-P policies adjust the insertion and promotionpoints, while considering the size of a newly inserted, orre-referenced, cacheline. When an eviction is necessary, theECM-ES and ECM-P policies select an appropriate victimamong the LRU cachelines, by also considering the sizeinformation. The goal is to minimize the eviction penalty whilemaximizing the effective capacity of the cache. Before delvinginto the details of each size-aware cache management policy,we present the fundamental principles of the RRIP framework[1] to help the reader understand the assumed underlyinginfrastructure.

4.1 The Fundamental Principles of the Re-ReferenceInterval Prediction (RRIP) Framework [1]

Figure 5(a) shows the conventional RRIP [1] framework,which uses M bits of meta-data for each cacheline, in orderto store one of 2M possible Re-Reference Prediction Values(RRPVs). The four possible RRPV values form an RRIP chain(one for each cacheline in the cache), as depicted in Figure5(a). If a cacheline stores an RRPV of zero, it is predictedto be re-referenced (re-used) in the near-immediate future. Onthe other hand, if a cacheline stores an RRPV of 2M − 1,the cacheline is predicted to be re-referenced in the distantfuture. In other words, cachelines with smaller RRPVs areexpected to be re-referenced sooner than cachelines with largerRRPVs. Therefore, on a cache miss, RRIP selects a victimamong the cachelines whose RRPV is 2M − 1 (distant re-reference interval). If there is no cacheline with an RRPVof 2M − 1, the RRPVs of all cachelines are increased by 1– called demotion – and this increasing process is repeateduntil a victim cacheline is found. On a hit, the RRIP changesthe RRPV of the hit cacheline to zero – called promotion– thus predicting the cacheline to be re-referenced in thenear-immediate future. When new data is initially fetchedfrom the memory device into a specific cacheline, the StaticRRIP (SRRIP) policy sets the cacheline’s initial RRPV as2M − 2, which indicates long re-reference interval, insteadof distant re-reference interval, to allow SRRIP more time tolearn and improve the re-reference prediction. However, whenthe re-reference interval of all the cachelines is larger than theavailable cache, SRRIP causes cache thrashing with no hits. Toaddress such scenarios, Bimodal RRIP (BRRIP) sets the ma-jority of new cachelines’ initial RRPVs as 2M − 1 (distant re-reference interval prediction), and it infrequently inserts newcachelines with RRPV of 2M − 2 (long re-reference intervalprediction). Dynamic RRIP (DRRIP) determines which policyis best suited for an application between SRRIP and BRRIPusing Set Dueling [17], which is a method that simultaneouslymonitors the performance of both SRRIP and BRRIP andchooses the winner among the two.

4.2 The ECM Insertion (ECM-I) Policy

The insertion policy of a cache management framework de-cides where to insert an incoming cacheline in the meta-data chain. Specifically, in the case of RRIP, the insertionpolicy decided the RRPV value to be given to new cachelines.Whereas RRIP makes this decision based solely on locality

Page 6: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 6

Fig. 5. The meta-data chains of the RRIP [1] and the proposedECM-I policies, assuming the use of a 2-bit (M=2) RRPV. Notethat each cacheline in the cache has its own RRPV value, whichimplies that each cacheline traverses its own chain, based onthe line’s hit/miss performance.

(re-reference rate) information, the proposed ECM-I policymust also account for the incoming line’s size. The main goalof the ECM-I policy is to give the big-size cachelines a higherchance of eviction, while minimizing the conflict with locality.For this purpose, we simply classify the cachelines into twocategories: big-size cachelines and small-size cachelines. Wethen adjust the initial insertion point of the cacheline’s re-reference interval, based on this classification. The cachelinesclassified as big-size are allocated with a higher RRPV thanthe small-size cachelines, so as to force the big-size cachelinesto be evicted sooner. However, if the RRPV of big-sizecachelines is set to 2M − 1 (distant re-reference intervalprediction), the big-size cachelines with high locality will alsobe evicted sooner, due to the lack of time to learn localityinformation. Therefore, the RRPV of big-size cachelines isset to 2M − 2 (long re-reference interval prediction) and thatof small-size cachelines to 2M − 3 (intermediate re-referenceinterval prediction), so that all the cachelines have enoughtime to learn locality information. Figure 5(b) shows the basicoperation of the proposed ECM-I policy. Since only the initialinsertion point (RRPV) in the RRIP chain must be modified,as compared to the baseline RRIP mechanism, no significanthardware alterations are necessary to support the ECM-Ischeme. However, the critical issue in this methodology ishow to classify a cacheline as either big-size, or small-size.This classification is very important, in order to balance thebig-size cacheline thrashing effect and the re-reference intervalprediction.One simple policy is to use a predefined threshold, Th

(2 ≤ Th < 16, where 16 is equal to the size of a singleuncompressed cacheline in number of segments, and 2 is thenumber of segments occupied by the smallest possible com-pressed cacheline). Each segment is 4 B long, and a thresholdis defined, in terms of a number of segments. Adjusting thethreshold value translates to adjusting the balance betweenthe locality and the size information. If the threshold is sethigh, the re-reference rate will be given more weight thanthe size information, while a lower threshold value means theexact opposite. Since there is no strong correlation between thesize and re-reference interval of the cacheline, as observed inFigure 3, the threshold value that yields the best performance

could only be empirically identified through sensitivity analy-sis. More specifically, the threshold value would be assumedto be static, and different threshold values would be explored.However, this assumption of a static threshold value wouldnot maximize the performance in real systems, because theoptimal threshold value varies with not only the application,but also the execution timeline of each application.Hence, we devise a dynamically adjustable threshold

scheme, which dynamically changes the threshold value bysimultaneously considering real-time effective capacity infor-mation and physical memory usage in a set. The thresholdvalue is updated every time a new cacheline is inserted in aset and the threshold is calculated on a per-set basis. In orderto efficiently determine the dynamically changing Th, while atthe same time minimizing the implementation overhead withineach set, we derive the following equation:

Th =

⌈(

Psize

WL

+Psize

WP

·

(

1−NV C

WL

))

·

Lsize

Psize

(1)

The parameter Psize indicates the physical size of a singleset, in terms of the number of segments (one segment isequal to 4 bytes). The parameters WP and WL indicate thephysical and logical number of ways in a set, respectively.These three terms are constant and they are fixed at designtime. On the other hand, NV C and Lsize indicate, respectively,the number of valid cachelines in a set, and the total number ofsegments in all the valid cachelines in a set. These quantitieschange dynamically at run-time; WP ≤ NV C ≤ WL and4 ≤ Lsize ≤ Psize. These two dynamically changing valuesare updated when a new cacheline is inserted. The left-mostterm in Equation (1), Psize

WL, indicates the average number of

segments occupied by a single compressed cacheline when aset includes the maximum possible number of cachelines (i.e.,WL). This term is inserted as a bias for the maximum effectivecapacity. Starting from this bias, the threshold is dynamicallychanged by PSize

WP· (1− NV C

WL), thereby reflecting the average

number of segments occupied by a single valid cacheline inreal-time. Finally, the threshold is recalculated by the right-most term in Equation (1), Lsize

Psize, which represents the physical

memory utilization of a set.Since the parameters Psize, WP , and WL are set to 64, 4,

and 16 in our evaluation, respectively, Equation (1) simplifiesto:

Th =

(20−NV C) ·Lsize

64

(2)

Based on this equation, the minimum possible value ofTh is the same as the bias value (i.e., 4 in our evaluation),when NV C reaches its maximum (i.e., 16). The threshold mayincrease up to the value of 16, as NV C decreases all the wayto 4.If NV C is equal to WL (i.e., 16), it implies that all the tags

are valid. In this case, the average size of cachelines in a setis relatively small, and we need to change the threshold toa sufficiently small value. Furthermore, we need to take intoaccount the physical memory usage, in order to change thethreshold more precisely. Even if the tags are fully used in aset, the physical memory usage can range from 32 segments(128 B) to 64 segments (256 B). Therefore, we need to changethe threshold not only with NV C , but also with Lsize. Forexample, when Lsize = 32 segments, then the threshold valuewill be (20 − 16)× (32/64) = 2, while for 64 segments thethreshold will be (20−16)× (64/64) = 4, as per the equationabove.On the contrary, if NV C is small (low effective capacity)

– which indicates the presence of several big-size cachelines– we need to change the threshold to a larger value. In this

Page 7: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 7

case, we also need to take into account the physical data areasize, because the small number of valid tags does not reflectthe fact that there are big-size cachelines in the data areaall the time. For example, if there are 8 valid tags, then thephysical data area size can range from 16 segments (64 B) to64 segments (256 B). Therefore, based on NV C and Lsize,the threshold value will be (20 − 8) × (16/64) = 3 for 16segments, or (20−8)×(64/64) = 12 for 64 segments. If thereare only uncompressed lines (i.e, NV C=WP=4 and Lsize=64segments), the threshold value will be (20−4)×(64/64) = 16.Thus, the size-aware insertion mechanism will be disabled,because no cacheline occupies more than 16 segments.Since the NV C and Lsize parameters (1) are defined at run-

time using the tag area information, (2) they have to be read,in order to check for a hit or miss when a new request arrives,and (3) they do not require any additional storage elements(the threshold is updated whenever a cacheline is inserted), thistechnique can be implemented in hardware fairly easily and ef-ficiently. As previously mentioned, the dynamically adjustablethreshold value is calculated on a per-set basis. To calculatethe threshold in hardware, ECM-I requires appropriate logic toimplement Equation (2). Given that the NV C and Lsize valuesare already stored in the tag area of each set, the hardwarelogic simply needs to perform a subtraction, a division, anda multiplication (as per Equation (2)). Note that the divisionoperation is performed by using simple shift logic, becausethe divisor is a power of 2. Only one logic block evaluatingEquation (2) is needed for the entire compressed cache, sincethe threshold calculation is performed as each cacheline ininserted into the set. Although Equation (2) is derived fromheuristics, we will demonstrate that it effectively balances theclassification rate of the various compressed cachelines.In summary, the ECM-I policy classifies incoming cache-

lines as big-size or small-size, based on a size threshold. Inour methodology, using more than two size classificationsrequires more than two insertion points in the RRIP chain.However, including more than two insertion points incurshigher implementation overhead, and it also places too muchweight on the size information, which breaks the desiredbalance in the locality vs. size information. Thus, we decidedto use only two classifications during the insertion, as aperformance/cost tradeoff. To counter this binary classificationlimitation, we dynamically adjust the threshold value in real-time, based on the effective capacity of the cache. The variablethreshold value captures the run-time salient characteristics ofeach application.

4.3 The ECM Promotion (ECM-P) Policy

As described in the previous sub-section, the ECM-I pol-icy considers the cacheline’s size information only once,at insertion time. However, our experiments have revealedthat this initial consideration is not enough to maximize theeffective capacity of the compressed cache. In the baselineRRIP framework [1], once a cacheline is inserted into theRRIP chain, the RRIP mechanism initiates promotions and/ordemotions, depending on the hit/miss behavior of the line.This is the method by which RRIP manages the cacheline’slocality information efficiently. In the ECM case, both thelocality and the size information must be accounted for duringthis promotion/demotion process. Consequently, we devise anECM Promotion (ECM-P) policy to tackle this balancing act.Similar to ECM-I, the goal of the ECM-P policy is to

enable a higher probability of eviction of big-size cachelines,while minimizing the potential conflict with locality-awarecache management. The ECM-P policy in ECM managesthe promotion process, whereby a cacheline traverses theRRIP chain. In the conventional RRIP framework [1], when a

Fig. 6. The ECM meta-data chain showing both the ECM-Ppolicy transitions (top of diagram) and the ECM-I policy insertionpoints (bottom of figure), assuming a 2-bit (M=2) RRPV. The sizeinformation is considered both during insertion of the line in thecache, and after each hit (promotion process).

cacheline is re-referenced its RRPV is promoted to zero (near-immediate re-reference interval prediction). On the other hand,the proposed ECM-P scheme adjusts the promotion step of there-referenced cacheline (i.e., the change in RRPV), based onthe line’s classification as big-size or small-size (recall that theclassification is performed using Equation (2), as defined inSection 4.2). Figure 6 depicts the operational principle of theproposed ECM-P policy. After a cacheline hit, ECM-P firstchecks the size information of the hit cacheline. If the line isclassified as small-size, its promotion point in the RRIP chainis the same as in the traditional RRIP framework, i.e., thecacheline’s RRPV is changed to zero. However, if the line isclassified as big-size, ECM-P changes the line’s RRPV to avalue higher than the promotion point of small-size cachelines.Specifically, as shown in Figure 6, the promotion point ofbig-size cachelines is set to an RRPV of one (intermediatere-reference interval prediction). Thus, if a big-size cachelinehappens to have the same locality information as a small-sizecacheline, the ECM-P process ensures that the probability ofeviction of the big-size cacheline is always higher than that ofthe small-size cacheline.A similar philosophy may be applied to the demotion

process. Under conventional RRIP, if a cache miss occurs andno cacheline is found with RRPV of 2M −1, the RRPV of allcachelines is increased by one. However, unlike the promotionprocess, the demotion process involves size comparisons ofall the cachelines in a set with a pre-defined threshold. Thisimplies that the determination of the demotion point whileaccounting for the size of the cacheline involves a non-negligible implementation overhead. Additionally, increasingthe RRPV of big-size cachelines to more than 1 has theeffect of putting undue weight on the size information. Infact, in some cases we observed that adjusting the demotionpoint of big-size cachelines may even lead to a decrease inperformance, because some big-size cachelines are evicted tooquickly, before they get a chance to learn locality informa-tion. Therefore, we apply the size-aware cache managementconcept only to the promotion process. For the demotion of acacheline (upon a miss), a more sophisticated technique willbe introduced later on, which considers size information in theeviction step without diluting the locality information.Note that the ECM-P scheme only changes the promotion

point in the RRIP chain, as compared to the size-agnosticconventional RRIP implementation. As a result, no signifi-cant alteration of the conventional framework is necessary toimplement the ECM-P policy.

4.4 The Size-Aware Victim Selection Process

The ECM-I and ECM-P policies (two of the four fundamentalpolicies of the proposed ECM architecture, as illustrated in

Page 8: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 8

Fig. 7. The RRPV values of the cachelines of one set in an8-way set-associative cache. Each rectangle (‘a’ to ‘h’) corre-sponds to one way – i.e., one cacheline – of the set. Thecachelines are ordered based on their RRPV values. A 2-bit(M=2) RRPV is assumed here.

Figure 4) focus on adjusting the insertion and promotion pointsof the various cachelines – while trying to balance the localityand size information. This sub-section focuses on ECM’s thirdmain component, i.e., the victim selection (and subsequenteviction) among a pool of candidate victims.

4.4.1 The ECM Replacement (ECM-R) Policy

Figure 7 shows the RRPV values of the cachelines of oneset in an 8-way set-associative cache. The letters ‘a’ to ‘h’correspond to the eight ways of the set (i.e., each rectanglerepresents one cacheline of the set). The cachelines are orderedbased on their RRPV values. The two right-most lines withRRPV of zero are at the “RRIP head,” whereas the three left-most lines have RRPV of three, and are, therefore, at the“RRIP tail.” In the baseline RRIP framework [1], the victim isselected among the cachelines whose RRPV is 2M -1 (distantre-reference interval). These lines constitute what is referredto as the eviction pool (the three left-most lines in Figure 7).Conventional methods tend to select the left-most victim (asshown in the figure), or a random victim, if there is more thanone cacheline in the eviction pool. This is a reasonable choice,because all the cachelines in the eviction pool have alreadybeen “studied” in terms of their re-reference interval, and theyhave been predicted to be re-used in the distant future. Thisreasoning, however, only captures the locality information.Similar to the insertion (ECM-I) and promotion (ECM-P)

policies, the ECM mechanism must account for both the sizeand locality information during the victim selection process aswell, i.e., the size information must be utilized by the evictionpolicy. As previously explained, the RRPV of the cachelinesin the eviction pool indicates distant re-reference intervals.Thus, we may consider the size information alone withoutaffecting the locality. In other words, selecting the biggest-sizecacheline from the eviction pool as the victim does not conflictwith the locality information. This is precisely the premiseof the proposed ECM Replacement (ECM-R) policy, theoperation of which is described in pseudo-code in Algorithm1. Even though this is a very straightforward methodology,it is very effective in compensating for the weakness of thecoarse-grained (binary) classification of the size informationperformed under the ECM-I policy. For example, let us assumethat there are two cachelines having the same re-referenceinterval; one is quite bigger than the other, but due to thecoarse-grained classification, they were classified under thesame type of cacheline (either big-size, or small-size). Thismeans that their probability of being selected as victims isalso the same, even though their actual sizes are not the same.Under the proposed scheme, the bigger-size cacheline willbe evicted earlier than the smaller one, if both cachelinesfind themselves in the eviction pool. In general, since thesize information is applied only to the eviction pool, potentialconflicts with the locality information are avoided.The process of evicting the largest-size cacheline first has

a two-fold advantage: (1) future cachelines have as much

Algorithm 1 The operation of the ECM-R policy, assuming

an M-bit RRPV./% finding a victim %/while new cacheline size > vacant data segment size do

if there is a cacheline whose RRPV = M-1 then

find the biggest-size cacheline whose RRPV = M-1victim ← biggest-size cachelinevacant data segment size ← vacant data segment size + biggest-sizecacheline size

else

for all the cachelines in the set doif RRPV 6= M-1 then

RRPV ← RRPV + 1end if

end for

end if

end while

Algorithm 2 The operation of the ECM-ES policy, assuming

an M-bit RRPV./% forming a new, size-guaranteed eviction pool %/while new cacheline size > eviction pool size do

for all the cachelines in the set doif RRPV 6= M-1 then

RRPV ← RRPV + 1end if

end for

end while

/% finding a victim from the size-guaranteed eviction pool %/while new cacheline size > vacant data segment size do

/% reduced ECM-R policy %/find the biggest-size cacheline whose RRPV = M-1victim ← biggest-size cachelinevacant data segment size ← vacant data segment size + biggest-sizecacheline size

end while

space as possible in the cache, and (2) the eviction penalty isreduced. The latter is a consequence of the fact that ECM-Rminimizes the number of evicted cachelines. Without ECM-R, more than two cachelines would have to be evicted fromthe cache to the main memory if the size of the incomingcacheline happens to be larger than the victim cacheline (asdescribed in Section 3.1). This, in turn, implies that the write-back buffer may stall more frequently. Instead, the ECM-Rpolicy minimizes the number of evicted cachelines by simplyselecting the biggest cacheline among the victim candidatesin the eviction pool.The ECM-R mechanism selects the biggest-sized cacheline

as a victim from the eviction pool, by using the 4-bit cachelinesize information tag already present in the tag area. TheECM-R scheme requires 15 4-bit comparators, in order toidentify the biggest-sized cacheline. The same comparatorsare also used by the ECM-I mechanism, in order to classifycachelines as big-size, or small-size. This is, in fact, theonly overhead incurred. Since the proposed ECM mechanismis intended to be used on top of the RRIP framework [1]and a decoupled variable-segment cache architecture [10], allpertinent structures are already in place by said mechanisms.Other than the 15 comparators, no significant alteration oradditional storage is required to implement the ECM-R policy.

4.4.2 The ECM Eviction Scheduling (ECM-ES) Policy

In most cases, the proposed ECM-R policy is very effective atminimizing the miss penalty. However, if the size of the newincoming cacheline happens to be larger than the combinedsize of all the cachelines in the eviction pool (i.e., larger thanthe sum of the cachelines’ sizes), the ECM-R policy is not so

Page 9: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 9

effective. Not enough room is created for the new cachelineawaiting to be inserted, even after all the cachelines in theeviction pool are evicted. For example, in Figure 7, if the sumof the sizes of the cachelines in the eviction pool – ‘a,’ ‘b,’ and‘c’ – is smaller than the size of the incoming cacheline, thencacheline ‘d’ should also be evicted after the demotion process.In order to tackle these problematic cases, we propose amodified replacement policy, called ECM Eviction Scheduling(ECM-ES). The main objective of the ECM-ES policy is tominimize the miss penalty upon evictions, by considering thecombined size of the entire eviction pool.Using the same example of Figure 7, if the size of cacheline

‘d’ is equal to, or greater than, the size of the incomingcacheline, then evicting cacheline ‘d’ (which is not part ofthe current eviction pool) may be a better option than evictingall the cachelines in the eviction pool – ‘a,’ ‘b,’ and ‘c’ –and then ‘d.’ Based on this observation, the ECM-ES policyextends the range of the eviction pool by performing pre-demote operations until the sum of the cacheline sizes ofthis extended eviction pool is larger than the size of the newcacheline waiting to be inserted. The expanded eviction pool iscalled a size-guaranteed eviction pool. The detailed operationof the ECM-ES policy (generating a size-guaranteed evictionpool and selecting a victim) is described in Algorithm 2.The ECM-ES policy first compares the combined size of thecachelines in the eviction pool to the new cacheline’s size. Ifthe size of the new cacheline is smaller than the combined sizeof the cachelines in the eviction pool, then the regular ECM-R policy is used. In the opposite case, the ECM-ES policy isused to create a size-guaranteed eviction pool by increasingthe RRPVs of all the cachelines in the set, until the combinedsize of the cachelines in the eviction pool is larger than thesize of the new cacheline. A victim is then selected from thesize-guaranteed eviction pool using a “reduced” ECM-R policy(second part of Algorithm 2). The ECM-ES policy effectivelyminimizes the miss penalty on a cache miss by reducing thenumber of evicted cachelines.

5 EXPERIMENTAL METHODOLOGY

5.1 Simulation Framework

A trace-driven simulator has been developed to evaluate theproposed ECM mechanism. Since the memory access se-quence and the compression ratio are the most important fac-tors for this evaluation, all traces have been extracted from theSimics full-system simulator [21], extended with GEMS [22],while simulating a quad-core processor with a two-level cachehierarchy. Ten multi-threaded benchmarks from the PARSECbenchmark suite [19] are selected, and each benchmark runsfor 300 million instructions to report the Instructions-Per-Cycle (IPC) metric. This methodology is similar to that in[1]. The L1 caches use an LRU replacement policy, and ourstudy focuses on the cache management of the LLC. We setthe main memory access latency to 250 CPU clock cycles.All the details of the simulation parameters are describedin Table 1. In addition to performance evaluation, we alsoperform energy consumption simulations by integrating powerconsumption specifications into our trace-driven simulator. Theenergy parameters of the SRAM-based LLC and the DRAM-based main memory system are obtained from CACTI [23]and a commercial DRAM data sheet [24], respectively.

5.2 The Compression Technique Employed in theLLC

This paper assumes that the targeted baseline architecture usescompression only in the LLC. No compression is assumedin the L1 caches, nor in main memory. As a compression

algorithm for the LLC, we choose Frequent Pattern Com-pression (FPC) [3] – a bit-level compression algorithm –because its compression performance is relatively high, withreasonable compression overhead, both in terms of delay andimplementation cost. The energy parameters of the FPC com-pression algorithm were obtained from [13]. We apply FPCwithin a word, so a compressed cacheline’s size varies from8 B (maximally compressed) to 64 B (uncompressed). Themaximally compressed cacheline occupies only 2 segments,while the uncompressed cacheline occupies 16 segments in thedecoupled variable-segment LLC, as illustrated in Figure 1. A2 MB physically 4-way LLC is used. We set the maximumnumber of ways in a set to 16 (logical ways), so the effectivecapacity can increase up to 8 MB, which is a significantimprovement. Without loss of generality, these parameters helpdemonstrate the efficiency of the proposed ECM mechanism.A critical operation in compressed LLCs is data compaction,

which is required when the cache has room for upcomingrequests (read misses, write hits, and write misses), but notin consecutive segments. In the case of a read miss, we dothe compaction while the data is being fetched from mainmemory, without knowing the size of the fetched cacheline.If the remaining space in the LLC after compaction is stillnot enough to hold the fetched data, we perform eviction.The fetched data is immediately passed to the upper-levelcache, while (1) the eviction of the old cacheline, and (2) thestorage of the new cacheline are performed in the background,until the next request arrives. This implies that no additionaldelay is incurred due to compaction on a read miss. However,for both write hits and misses, the cacheline size changesand compaction is necessary in most requests, accompaniedby a compaction overhead. In the cases of write hits andwrite misses, we consider a 16-cycle compaction overheadand two additional 20-cycle L2 response latencies (20-cycleread latency + 16-cycle compaction latency + 20-cycle writelatency). Without completing a compaction, data cannot bewritten in the cache. Thus, our evaluation will accuratelyaccount for the compaction overhead (shown in Table 1) ofwrite requests.

6 EVALUATION & ANALYSIS

6.1 Workload Characteristics

We begin our evaluation by first analyzing the salient charac-teristics and cache-related behavior of the application work-loads used in the study of the proposed ECM mechanism.These key characteristics are the set-associativity sensitivity,the achievable compression ratios, the distributions of cache-line sizes, and the locality attributes of each cacheline size.When we overlay a logically 16-way cache on a physically4-way 2 MB LLC using compression, the maximum effectivecapacity is increased up to 8 MB. This effect is, essentially, thesame as increasing the number of ways without changing thenumber of sets, as shown in Figure 8(a). The y-axis shows the

TABLE 1Simulated system parameters.

Number of CMP cores 4

Processor core type UltraSPARC-III+, 2 GHz

L1 caches (private) I- and D-caches 32 KB, 4-way, 64 B

L1 response latency 3 cycles

L2 caches (shared) 2 MB, 4-way, 64 B, NUCA, MESI

L2 response latency 20 cycles

L2 read hit overhead 5 cycles for decompression

L2 writeback buffer 8 entries

Compaction overhead 16 cycles

DRAM memory DDR3 4 GB

Main memory response latency 250 cycles

Page 10: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 10

5

10

15

20

25

4 (2MB) 8 (4MB) 12 (6MB) 16 (8MB) 20 (10MB)

MP

KI

Number of Ways

vipsbodytrack

streamclusterferret

freqmineswaptions

cannealfacesim

raytracex264

(a) Miss behavior sensitivity to physical cache set-associativity

0

0.2

0.4

0.6

0.8

1

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

0

20

40

60

80

100

Co

mp

ress

ion

Ra

tio

% o

f T

ota

l

2-segments ( 8B) 3-segments (12B) 4-segments (16B) 5-segments (20B) 6-segments (24B) 7-segments (28B) 8-segments (32B) 9-segments (36B)10-segments (40B)11-segments (44B)12-segments (48B)13-segments (52B)14-segments (56B)15-segments (60B)16-segments (64B)Compression ratio

(b) Achievable compression ratios and distributions of line sizes

Fig. 8. The salient cache-related characteristics and behaviorof the application workloads used in this study.

number of misses per thousand instructions (MPKI), while thex-axis shows the number of ways and the physical cache size.This set-associativity sensitivity directly shows how much wecan reduce the number of misses by increasing the effectivecapacity of each application. Even though setting the numberof ways to 20 enhances performance over a 16-way setup, weset the number of logical ways to 16 in all the evaluations,because this design choice strikes the most efficient tradeoffbetween obtained performance and implementation overhead.Figure 8(b) shows the achieved compression ratios and the

distributions of the compressed cacheline sizes for the 10evaluated benchmarks. A lower compression ratio indicateshigher compressibility. The benchmark vips – which is themost compressible application – has over 50% of the totalcachelines compressed within 2 segments (8 B). As previouscompression studies have indicated, the main reason for thishigh compressibility is the large number of zero values. The16-segment bars (64 B) indicate the incompressible cache lineratio; it accounts for 19% of vips. On the other hand, thebenchmark x264 shows the lowest compressibility. Only 20%of the total cachelines are compressible to within 2 segments,while almost 40% of the cachelines are incompressible.

6.2 Assessing the Size-Aware Cache ManagementPolicies

6.2.1 The Effects of Compression and Size-Awareness

The proposed ECM mechanism is first compared to LRUwithout compression (LRU-u), LRU with compression (LRU-c), DRRIP [1] without compression (DRRIP-u), and DRRIPwith compression (DRRIP-c), in terms of the effective capac-ity, cache miss-count reduction, and system performance. ForECM and DRRIP, 3-bit RRPVs are used. Based on extensiveexperimentation and sensitivity analyses, we set the big-sizeinsertion point to an RRPV of 23 − 2 and the small-sizeinsertion point to 23 − 3, because these values maximize theaverage percent miss count reduction. In order to isolate thecontributions of the ECM-P and ECM-ES policies, we firstexamine a “Baseline ECM” (BECM) scheme, which includesonly two of the four main policies presented in this work;

1

1.5

2

2.5

3

3.5

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

AvgEff

ecti

ve

Cap

aci

ty E

nh

an

cem

ent

Norm

ali

zed

to L

RU

-u

LRU-u DRRIP-u LRU-c DRRIP-c BECM

(a) Effective cache capacity improvement

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

Avg

Mis

s C

ou

nt

Red

uct

ion

Norm

ali

zed

to L

RU

-u

LRU-u DRRIP-u LRU-c DRRIP-c BECM

(b) Cache miss-count reduction

1

1.2

1.4

1.6

1.8

2

2.2

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

Avg

IPC

Per

form

an

ce N

orm

ali

zed

to L

RU

-uLRU-u DRRIP-u LRU-c DRRIP-c BECM

(c) Overall system performance

Fig. 9. Assessing the overall performance of a “Baseline ECM”(BECM) mechanism, which only employs the ECM-I and ECM-Rpolicies, but not ECM-P/ECM-ES.

namely, ECM-I and ECM-R (but not ECM-P, or ECM-ES).This BECM setup was initially presented in [25].

Figure 9(a) shows the effective cache capacity – normalizedto LRU-u – assuming a 2 MB uncompressed LLC. Merelyapplying data compression without changing the replacementpolicy improves the effective capacity between 1.5 to 3 times,depending on the application. The BECM technique improvesthe effective capacity by an average of 2.5 times, by simplyconsidering the size information, while the LRU-c and theDRRIP-c techniques improve the capacity by an average of 2.2and 2.1 times, respectively. The vips benchmark shows thehighest capacity enhancement under all replacement policies,while facesim shows the least capacity enhancement underLRU-c and DRRIP-c. This indicates that the effective capac-ity enhancement exhibited by an application is not alwaysproportional to the compression ratio, because some big-sizecachelines may stay longer in the cache (if they have higherlocality) than small-size cachelines.

Figure 9(b) shows results pertaining to the cache miss countreduction, normalized to LRU-u. The vips, swaption,

and raytrace benchmarks are ranked first, second, andthird, with 156.6%, 76.8%, and 67.1% miss count reductions,respectively – as compared to LRU-u – under the BECMmechanism. These are the applications that are most sensitiveto the set-associativity and physical cache size, as indicated

Page 11: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 11

0.85

0.9

0.95

1

1.05

vips bodytrackstreamcluster

ferret freqmineswaptions

cannealfacesim raytrace

x264 Avg

Mis

s C

ou

nt

Red

ucti

on

Norm

ali

zed

to B

EC

M

DRRIP-cBECM

BECM+SAP-RRPV=1BECM+SAP-RRPV=2BECM+SAP-RRPV=3BECM+SAP-RRPV=4BECM+SAP-RRPV=5BECM+SAP-RRPV=6

Fig. 10. Investigation of the effect of adding the ECM-P policy to BECM (i.e., adding ECM-P to ECM-I+ECM-R). The RRPVpromotion point of big-size cachelines is varied from 1 to 6, in order to identify the most effective promotion point under ECM-P. A3-bit RRPV is assumed.

in Figure 8(a). This attribute implies that their benefits mostlyemanate from enhancements in the effective cache capacity.We also found that the cache miss reduction is not directlyproportional to the enhancement of the effective capacity,because of the conflicts between the locality and the sizeinformation. For example, the streamcluster and ferretbenchmarks show the second- and third-best improvementsin effective capacity, respectively. However, their miss countreductions are lower than those of swaption and raytrace.On the average, BECM achieves a 54.1% miss count reduction,as compared to LRU-u.

Finally, we compare the overall system performance (interms of Instructions Per Cycle, IPC) of each configuration, asshown in Figure 9(c). The y-axis shows the IPC normalized toLRU-u. The vips benchmark shows the best IPC performanceimprovement under all investigated techniques: 73.8% underLRU-c, 89.9% under DRRIP-c, and 109.3% under BECM.This is mainly due to the large miss count reduction under alltechniques. We observe significant performance enhancementin most applications, except facesim and freqmine. Weeven observe some performance degradation in freqmine,regardless of the replacement policy. These problematic bench-

marks extract very little benefit from compression, so they suf-fer from excessive compression/decompression overhead. Forsuch applications, an adaptive cache compression technique[10] would be a useful solution for minimizing the perfor-mance degradation. On the average, the BECM mechanismenhances the system performance by 18.4%, 6.2%, and 2.5%over LRU-u, LRU-c and DRRIP-c, respectively.

6.2.2 The Effect of Adding the ECM-P Policy to ECM-I+ECM-R

The proposed ECM architecture consists of a combination ofindividual techniques: ECM-I, ECM-P, and ECM-R+ECM-ES. In the previous sub-section, we analyzed the performanceof BECM, which is a reduced ECM design employing onlyECM-I and ECM-R [25]. The next few sub-sections will focuson further enhancing ECM’s performance by incrementallyapplying the ECM-P and ECM-ES policies. To distinguish itfrom BECM, the full-fledged ECM design with all presentedpolicies will be henceforth referred to as the “EnhancedECM” (EECM). Before proceeding to assess EECM, we firstanalyze the effect of adding the ECM-P policy alone (withoutECM-ES) to BECM (i.e., adding ECM-P to ECM-I+ECM-R).

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

AvgEff

ecti

ve

Ca

pa

city

En

ha

nce

men

t N

orm

ali

zed

to

DR

RIP

DRRIP-c BECM

ECM-P+BECM EECM=ECM-ES+ECM-P+BECM

1

1.05

1.1

1.15

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

Avg

Mis

s C

ou

nt

Red

uct

ion

No

rma

lize

d t

o D

RR

IP

DRRIP-cBECM

ECM-P+BECMEECM=ECM-ES+ECM-P+BECM

(a) Effective cache capacity improvement (b) Cache miss-count reduction

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

Avg

Wri

te-b

ack

Del

ay

Red

uct

ion

No

rma

lize

d t

o D

RR

IP

DRRIP-cBECM

ECM-P+BECMEECM=ECM-ES+ECM-P+BECM

1

1.05

1.1

1.15

vipsbodytrack

streamcluster

ferretfreqmine

swaptions

cannealfacesim

raytracex264

AvgIPC

Per

form

an

ce I

mp

rov

emen

t N

orm

ali

zed

to

DR

RIP

DRRIP-cBECM

ECM-P+BECMEECM=ECM-ES+ECM-P+BECM

(c) Write-back delay reduction (d) Overall system performance

Fig. 11. Performance evaluation of the “Enhanced ECM” Mechanism (ECM-I+ECM-R+ECM-P+ECM-ES). In order to isolate thebenefits of each individual policy, the performance results of designs with incrementally fewer policies are also presented. All resultsare normalized to DRRIP-c.

Page 12: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 12

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

DRRIP-u

DRRIP-c

BECM

EECM

En

ergy C

on

sum

pti

on

N

orm

ali

zed

to D

RR

IP-c

SRAM_ASRAM_I

DRAM_RDRAM_W

DRAM_ICOM

Avgx264raytracefacesimcannealswaptionsfreqmineferretstreamclusterbodytrackvips

Fig. 12. Analysis of the energy consumption of the proposed mechanisms. The six major components of the total energyconsumption are described in Table 2. The results are normalized to the energy consumption of DRRIP-c.

The ECM-P policy can be orthogonally applied to BECMto further enhance the performance of the system, but theproper promotion points should be decided after appropriatesensitivity analysis. Toward this end, we integrate the ECM-P policy into BECM, and statically change the promotionpoint of big-size cachelines to investigate the impact on miss-count reduction. Note that the promotion point of small-sizecachelines is fixed to RRPV=0 to ensure optimal localityawareness. By then varying the promotion point of big-size cachelines, one may assess the effect of size awarenessand identify the optimal balance between locality and sizeawareness. Since the assumed RRPV bit-width (i.e., the widthof the M-bit register) is 3, the RRPV takes values from 0 to 7,where 0 indicates near-immediate re-reference interval and 7indicates distant re-reference interval. In our experiments, theRRPV promotion point of big-size cachelines is varied from 1to 6, in order to give big-size cachelines an increasingly largerprobability of eviction. Figure 10 shows the simulation results,normalized to BECM. One may observe that the ferret,swaption, facesim, and raytrace applications donot show appreciable miss-count reductions, as compared toBECM. This result indicates that over-consideration of sizeinformation may pollute locality information and worsen thecache performance. Moreover, some performance degradationis also observed in several applications, if we set the promotionpoint (RRPV in RRIP chain) of big-size cachelines closerto 2M − 1 (distant re-reference interval), because of a veryshort travel time within the RRIP chain. However, the vips,streamcluster, and x264 applications show the highermiss-count reductions when we set the promotion point ofthe big-size cachelines to RRPVs of 3 or 4. On the contrary,the freqmine and canneal applications show the highermiss-count reductions under RRPV insertion points of 1 or2. Even though the optimal promotion points vary with theapplications, setting the RRPV promotion point to 2 showsthe best average miss count reduction (1.3%), as comparedto BECM. For the rest of the paper, unless otherwise stated,we only provide results with the promotion point of big-sizecachelines set to RRPV=2.

6.2.3 Performance Analysis of the Enhanced ECMMechanism (ECM-I+ECM-R+ECM-P+ECM-ES)

This sub-section will assess the combined effects of all ECMtechniques. The baseline reference architecture is DRRIP withcompression (DRRIP-c), which is the state-of-the-art size-agnostic cache management policy. In order to appreciate theeffects of each individual policy of the ECM architecture, wealso include the BECM design (ECM-I+ECM-R) [25], ECM-P+BECM (as described in the previous sub-section), and thecomplete, Enhanced ECM (EECM), which corresponds toECM-P+ECM-ES+BECM.

TABLE 2The six major components comprising the total energy

consumption of the various cache management schemes.

Terms Description

COM Compression/decompression energy

DRAM I DRAM leakage energy

DRAM W DRAM write energy

DRAM R DRAM read energy

SRAM I SRAM leakage energy

SRAM A SRAM active energy

Figure 11 shows the performance comparison results, nor-malized to DRRIP-c. In addition to the effective cache capacityenhancement, the miss-count reduction, and the IPC improve-ment, we also analyze the write-back delay reduction, whichis the main benefit of the ECM-ES policy. As expected, theECM-P+BECM and the EECM (ECM-P+ECM-ES+BECM)designs improve on almost all evaluation metrics, as comparedto DRRIP-c and BECM.The vips and streamcluster benchmarks show relatively

higher miss-count reductions with ECM-P, although theireffective cache capacity improvements are not significant.This results in an overall system performance improvementof 14.7% and 6.2%, respectively, because ECM-P helps tomaintain high-locality big-size cachelines longer in the cachethan low-locality big-size cachelines. In other words, ECM-Paccelerates the eviction of low-locality big-size cachelines, byreducing the travel time of the big-size cachelines in the RRIPchain. The swaption and facesim applications do not improvethe cache miss-count and system performance when comparedto the BECM, although they exhibit effective capacity im-provements. As previously mentioned in Section 6.2.2, theover-consideration of size information may pollute the localityinformation and, thus, worsen the cache performance in theseapplications.The ECM-P policy has very little effect on the write-back

delay, except with vips, because the write-back delay is pri-marily proportional to the number of evicted cachelines, whichECM-P does not significantly affect. The write-back delay ofvips is reduced mainly due to the large reduction in cachemisses. Note that freqmine is the one application where thecompression/decompression overhead exceeds the benefits ofBECM, as shown in Figure 9. Although the performance ofECM-P+BECM is worse than DRRIP-c, the ECM-P policyenhances the performance, compared to BECM. In summary,by giving a little more weight to the size information thanto the locality information in ECM-P, the ECM-P+BECMsetup enhances the average performance of the system by3.3%, compared with DRRIP-c, and 0.79%, compared withBECM. Although these average improvements are not verysignificant, the addition of ECM-P is still meaningful, becauseits implementation cost is almost negligible.The ECM-ES policy is also helpful in further enhancing the

Page 13: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 13

0

1

2

3

4

5

6

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

Norm

ali

zed

IP

C P

erfo

rman

ce

Physical cache size

64MB32MB16MB8MB4MB2MB1MB

0

0.2

0.4

0.6

0.8

1

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

Norm

ali

zed

En

ergy C

on

sum

pti

on

Physical cache size

SRAM_ASRAM_I

DRAM_R DRAM_W

DRAM_ICOM

64MB32MB16MB8MB4MB2MB1MB

(a) Overall system performance (b) Energy consumption

Fig. 13. The performance and energy consumption sensitivity of vips to the physical LLC size, assuming a logically 16-way LLCconfiguration for ECM and DRRIP-c. All the results are normalized to a physically and logically 4-way DRRIP-u setup with 1 MBuncompressed LLC.

effective cache capacity, as shown in Figure 11, even thoughimprovements to the cache miss-count are not noticeable. Forexample, the facesim benchmark shows the highest effectivecapacity enhancement, but its cache miss-count is only slightlyreduced, much like other applications. In fact, the vips,

streamcluster, and raytrace applications exhibit aslight increase in the cache miss-count. This is attributed tothe unbalanced consideration between size and locality in theseapplications.Unlike ECM-P, the main focus of ECM-ES is the reduction

of the eviction penalty by repeatedly applying demote oper-ations before selecting a victim, based on size information.Figures 11(c) and (d) demonstrate that the ECM-ES policysignificantly reduces the eviction penalty by minimizing thenumber of evicted cachelines per cache miss. This translatesinto an improvement in overall system performance. Conse-quently, the EECM design – which combines all policies:ECM-I+ECM-R+ECM-P+ECM-ES – shows the highest av-erage performance improvement at 4.2%, as compared toDRRIP-c.Finally, the energy consumption of the designs under eval-

uation is compared in Figure 12. The results are normalizedto the energy consumption of DRRIP-c. The total energyconsumption is decomposed into six components, as shown inTable 2. The energy consumption of the compaction operationwhen a miss occurs is approximated as the energy of anextra L2 cache access (using the SRAM A parameter). Thecompression/decompression operations represent about 6% ofthe total consumed energy in most applications. Overall, theenergy reduction demonstrated by EECM is proportional toits performance improvement, because of the reduced numberof main memory accesses and reduced execution times. TheEECM setup reduces the energy consumption for all ap-plications, except freqmine, canneal, and facesim.

0

0.5

1

1.5

2

2.5

DRRIP-u

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

DRRIP-u

DRRIP-c

EECM

Norm

ali

zed

IP

C P

erfo

rman

ce

Set associativity

20-way16-way12-way8-way4-way

Fig. 14. The performance sensitivity of vips to the set associa-tivity, assuming a 2 MB LLC. All the results are normalized to aphysically 4-way DRRIP-u setup with 2 MB uncompressed LLC.

In these applications, the energy consumed during compres-sion/decompression and compaction – including the additionalenergy due to the extra tag area – exceeds the small energyreduction in the memory components. However, in all other ap-plications, the significant reduction in the DRAM and SRAMleakage energy consumption outweighs this energy overhead.On the average, DRRIP-c reduces the energy consumption by5.9%, as compared to the DRRIP-u (a direct consequence ofcompression), while the proposed EECM architecture furtherreduces the energy consumption by 3.8%, as compared toDRRIP-c.

6.3 Performance Sensitivity to Cache Size and Logi-cal Set Associativity

Though the proposed ECM architecture successfully enhancesthe performance and energy consumption, it is importantto explore other key design parameters, in addition to thereplacement policy. Hence, we investigate the performanceand energy consumption sensitivity of the full EECM setup(ECM with all policies presented in this article) on the physicalcache size and the logical set associativity. Figure 13 comparesthe IPC performance and the energy consumption by varyingthe physical LLC size, from 1 MB to 64 MB. The vips

benchmark, which shows the highest IPC improvement, isused in this evaluation. The effective capacity of each physicalcache size can be increased up to four times, because all LLCsare logically 16-way overlaid on top of a physically 4-waycache. All results are normalized to DRRIP-u with 1 MB LLC.As shown in Figure 13(a), the performance of all configu-

rations improves as the physical size of the LLCs increases.The performance of DRRIP-u starts to saturate at 64 MBLLC (not shown in the graph – which stops at 64 MB –but verified by further experiments), indicating that cachesizes beyond 64 MB are not helpful in further reducing thenumber of cache misses. On the other hand, the performanceof ECM starts to saturate at LLC sizes of around 16 to32 MB, because its effective capacity has already reached64 MB. Beyond this point (cache sizes larger than 32MB),the compressed LLCs show less performance improvement– or even slight performance degradation – because of thecompression overhead. This implies that ECM works very welleven when the physical cache capacity is not large enough tofit the data set of the application. Conversely, ECM allowsfor the reduction of the physical cache size to one half (or aquarter) of the original size required in uncompressed LLCs,without severe performance degradation.From the perspective of the energy consumption – illus-

trated in Figure 13(b) – increasing the physical LLC size isaccompanied by both positive and negative effects. Increasingthe physical LLC size results in an increase in the energy

Page 14: Size-Aware Cache Management for Compressed Cache Architectures

0018-9340 (c) 2013 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEEpermission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TC.2014.2360518, IEEE Transactions on Computers

SUBMISSION TO IEEE TRANSACTIONS ON COMPUTERS 14

consumption per LLC access, but a reduction in the number ofmain memory accesses. Thus, by increasing the physical LLCsize, the total energy consumption in the cache (SRAM A inFigure 13(b)) increases, while the DRAM energy consumption(DRAM R and DRAM W) decreases. The reduced energyconsumption in the DRAM R and DRAM W componentsoutweighs the energy consumption increase in the cache, untilthe cache size is 8 MB. However, when the cache size is16 MB and larger, the increased energy consumption in thecache dominates the energy reduction in other components.As a result, the entire energy consumption due to the memorysystem increases, as the physical LLC size exceeds 8 MB.The performance sensitivity of ECM to the set associativity

was also investigated, as shown in Figure 14. Specifically, fivedifferent logical set associativities were analyzed: 4-way, 8-way, 12-way, 16-way, and 20-way for the DRRIP-c and EECMconfigurations. Since a 2 MB physically 4-way LLC is used asa baseline, the effective capacity increases up to 4 MB, 6 MB,8 MB, and 10 MB, respectively. However, the configuration ofDRRIP-u is indicating physical set associativity, so the numberof sets is reduced when increasing the set associativity. Asindicated in Figure 14, increasing set associativity withoutincreasing the entire cache capacity does not yield significantbenefit, due to the reduced number of sets.However, even under fixed physical cache capacity, increas-

ing the number of logical ways enhances the performance,because of the corresponding increase in the effective cachecapacity. Regardless, the performance enhancement starts tosaturate from the 16-way configuration, because the needto hold more than 16 ways in a cache set is rare in mostapplications, even in a compressed cache. Hence, the 16-waysetup is the most cost-efficient configuration.

7 CONCLUSION

The “memory wall” phenomenon is one of the biggest ob-stacles faced by microprocessor designers. The growing gapbetween processor and memory performance elevates thecriticality of the cache hierarchy and magnifies the importanceof the Last-Level Cache (LLC). While the LLC size continuesto grow with each technology generation, application demandsalso increase. One approach to increasing the effective cachecapacity without increasing the physical capacity is to com-press the LLC.The cacheline size in compressed caches is variable and

depends on the achievable compression ratios. This articleinvestigates the importance of cacheline size in the cachemanagement policy. Currently, there are no replacement poli-cies tailored to compressed LLCs, and all existing approachesare line-size-agnostic. The proposed Effective Capacity Max-imizer (ECM) mechanism is a size-aware cache managementpolicy specifically geared toward compressed LLCs. The ECMscheme aims to maximize the performance and minimize theenergy consumption of the cache, through a quartet of policiesworking in tandem: (1) ECM Insertion (ECM-I), (2) ECMPromotion (ECM-P), (3) ECM Eviction Scheduling (ECM-ES), and (4) ECM Replacement (ECM-R).Extensive evaluation using memory traces extracted from

real application workloads running on a full-system simula-tion environment demonstrate the efficacy and efficiency ofECM. Specifically, the proposed ECM architecture exhibitsan average effective cache capacity increase of 23.9%, andan average cache miss reduction of 5.6%, as compared to thestate-of-the-art DRRIP framework [1]. These enhancementstranslate to an average system performance improvement of4.2% over DRRIP. Similarly, ECM is shown to lower thememory system’s energy consumption by 3.8%, on average,compared to DRRIP.

REFERENCES

[1] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “Highperformance cache replacement using re-reference interval prediction(RRIP),” in Proceedings of the 37th annual international symposium onComputer architecture, ser. ISCA ’10, 2010.

[2] International Technology Roadmap for Semiconductors, Semi-conductor Industry Association, 2010. [Online]. Available:http://www.itrs.net/Links/2010ITRS/Home2010.htm

[3] A. R. Alameldeen and D. A. Wood, “Frequent pattern compression:A significance-based compression scheme for L2 caches,” in TechnicalReport 1500. University of WisconsinMadison: Computer SciencesDepartment, 2004.

[4] X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsas, “C-pack: Ahigh-performance microprocessor cache compression algorithm,” VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18,no. 8, pp. 1196 –1208, aug. 2010.

[5] P. Franaszek, J. Robinson, and J. Thomas, “Parallel compression withcooperative dictionary construction,” in Proceedings of the Conferenceon Data Compression, ser. DCC ’96. Washington, DC, USA: IEEEComputer Society, 1996, pp. 200–.

[6] J.-S. Lee, W.-K. Hong, and S.-D. Kim, “An on-chip cache compressiontechnique to reduce decompression overhead and design complexity,” J.Syst. Archit., vol. 46, no. 15, Dec. 2000.

[7] L. Villa, M. Zhang, and K. Asanovic, “Dynamic zero compression forcache energy reduction,” in Proceedings of the 33rd annual ACM/IEEEinternational symposium on Microarchitecture, ser. MICRO 33. NewYork, NY, USA: ACM, 2000, pp. 214–220.

[8] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression in datacaches,” in Proceedings of the 33rd annual ACM/IEEE internationalsymposium on Microarchitecture, ser. MICRO 33. New York, NY,USA: ACM, 2000, pp. 258–265.

[9] A.-R. Adl-Tabatabai, A. M. Ghuloum, and S. O. Kanaujia, “Compressionin cache design,” in Proceedings of the 21st annual internationalconference on Supercomputing, ser. ICS ’07, 2007.

[10] A. R. Alameldeen and D. A. Wood, “Adaptive cache compressionfor high-performance processors,” in Proceedings of the 31st annualinternational symposium on Computer architecture, ser. ISCA ’04.Washington, DC, USA: IEEE Computer Society, 2004, pp. 212–.

[11] E. Hallnor and S. Reinhardt, “A unified compressed memory hierarchy,”in High-Performance Computer Architecture, 2005. HPCA-11. 11thInternational Symposium on, feb. 2005, pp. 201 – 212.

[12] E. G. Hallnor and S. K. Reinhardt, “A compressed memory hierarchyusing an indirect index cache,” in Proceedings of the 3rd workshop onMemory performance issues: in conjunction with the 31st internationalsymposium on computer architecture, ser. WMPI ’04, 2004.

[13] S. Kim, J. Lee, J. Kim, and S. Hong, “Residue cache: a low-energylow-area l2 cache architecture via compression and partial hits,” inProceedings of the 44th Annual IEEE/ACM International Symposiumon Microarchitecture, ser. MICRO-44 ’11, 2011.

[14] Y. Xie and G. Loh, “Thread-aware dynamic shared cache compressionin multi-core processors,” in Computer Design (ICCD), 2011 IEEE 29thInternational Conference on, oct. 2011, pp. 135 –141.

[15] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., andJ. Emer, “Adaptive insertion policies for managing shared caches,” inProceedings of the 17th international conference on Parallel architec-tures and compilation techniques, ser. PACT ’08. New York, NY, USA:ACM, 2008, pp. 208–219.

[16] H. Liu, M. Ferdman, J. Huh, and D. Burger, “Cache bursts: A newapproach for eliminating dead blocks and increasing cache efficiency,”in Proceedings of the 41st annual IEEE/ACM International Symposiumon Microarchitecture, ser. MICRO 41. Washington, DC, USA: IEEEComputer Society, 2008, pp. 222–233.

[17] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptiveinsertion policies for high performance caching,” in Proceedings of the34th annual international symposium on Computer architecture, ser.ISCA ’07. New York, NY, USA: ACM, 2007, pp. 381–391.

[18] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer, “SHiP: signature-based hit predictor for high performancecaching,” in Proceedings of the 44th Annual IEEE/ACM InternationalSymposium on Microarchitecture, ser. MICRO-44 ’11, 2011.

[19] R. Bagrodia and et al., “Parsec: A parallel simulation environment forcomplex systems,” Computer, vol. 31, pp. 77–85, October 1998.

[20] G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. A.Kozuch, and T. C. Mowry, “Exploiting compressed block size as anindicator of future reuse,” in SAFARI Technical Report. Carnegie MellonUniversity: Computer Architecture Lab, 2013-003.

[21] P. S. Magnusson and et al., “Simics: A full system simulation platform,”Computer, vol. 35, pp. 50–58, February 2002.

[22] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu,A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood,“Multifacet’s general execution-driven multiprocessor simulator (gems)toolset,” SIGARCH Computer Architecture News, vol. 33, p. 2005, 2005.

[23] CACTI 6.5. [Online]. Available: http://www.hpl.hp.com/research/cacti/[24] Samsung Electronics, “DDR2 registered SDRAM module,

M393T5160QZA,M393T5750GZ3,” Datasheet.[25] S. Baek, H. G. Lee, C. Nicopoulos, J. Lee, and J. Kim, “ECM: Effective

capacity maximizer for high-performance compressed caching,” in High-Performance Computer Architecture, 2013. HPCA-19. 19th InternationalSymposium on, feb. 2013.