A survey of architectural techniques for improving cache power efficiency

Ap

SF

a

ARRA

KCDLPEG

1

dpdtecp

breasimsAfcscf

2h

Sustainable Computing: Informatics and Systems 4 (2014) 33–43

Contents lists available at ScienceDirect

Sustainable Computing: Informatics and Systems

jou rn al hom ep age: www.elsev ier .com/ locate /suscom

survey of architectural techniques for improving cacheower efficiency

parsh Mittal ∗

uture Technologies Group, Oak Ridge National Laboratory (ORNL), Oak Ridge, TN, United States

r t i c l e i n f o

rticle history:eceived 13 November 2012eceived in revised form 17 July 2013ccepted 6 November 2013

a b s t r a c t

Modern processors are using increasingly larger sized on-chip caches. Also, with each CMOS technologygeneration, there has been a significant increase in their leakage energy consumption. For this reason,cache power management has become a crucial research issue in modern processor design. To addressthis challenge and also meet the goals of sustainable computing, researchers have proposed severaltechniques for improving energy efficiency of cache architectures. This paper surveys recent architectural

eywords:ache energy saving techniquesynamic energyeakage energyower managementnergy efficiency

techniques for improving cache power efficiency and also presents a classification of these techniquesbased on their characteristics. For providing an application perspective, this paper also reviews severalreal-world processor chips that employ cache energy saving techniques. The aim of this survey is to enableengineers and researchers to get insights into the techniques for improving cache power efficiency andmotivate them to invent novel solutions for enabling low-power operation of caches.

reen computing

. Introduction

As we are entering into an era of green computing, theesign of energy efficient IT solutions has become a topic ofaramount importance [1]. Recently, the primary objective in chipesign has been shifting from achieving highest peak performanceo achieving highest performance-energy efficiency. Achievingnergy efficiency is important in the design of all range of pro-essors, such as battery-driven portable devices, desktop or serverrocessors to supercomputers.

To meet the dual and often conflicting goals of achievingest possible performance and best energy efficiency, severalesearchers have proposed architectural techniques for differ-nt components of the processor, such as processor core, cachesnd DRAM (dynamic random access memory). For several rea-ons, managing energy consumption of caches is a crucial issuen modern processor design. With each CMOS (complementary

etal oxide semiconductor) technology generation, there is aignificant increase in the leakage energy consumption [2,3].ccording to the estimates of International Technology Roadmap

or Semiconductors (ITRS); with technology scaling, leakage poweronsumption will become a major industry crisis, threatening the
urvival of CMOS technology itself [4]. Further, the number of pro-essor cores on a single chip has greatly increased over years anduture chips are expected to have much larger number of cores [5].
∗ Tel.: +1 865 574 8531.E-mail address: [email protected]

210-5379/$ – see front matter © 2013 Elsevier Inc. All rights reserved.ttp://dx.doi.org/10.1016/j.suscom.2013.11.001

© 2013 Elsevier Inc. All rights reserved.

Finally, to bridge the gap between the speed of processor and mainmemory, modern processors are using caches of increasingly largersizes. For example, modern desktop processors generally have8 MB last level caches [6] while server systems are designed with24–32 MB last level caches [7,8]. On chip caches consume 16% ofthe total power in Alpha 21264 and 30% of the total power in Stron-gARM processor core [9]. In both Niagara and Niagara-2 processors,L2 cache consumes nearly 24% of the total power consumption [10].Thus, the power consumption of caches is becoming a large fractionof processor power consumption. To address the challenges posedby these design trends, a substantial amount of research has beendirected towards achieving energy efficiency in caches. The focusof this paper is to review some of these techniques.

The rest of the paper is organized as follows. Section 2 presentsa brief background on modeling of CMOS energy consumptionand discusses the essential design principles which guide cacheenergy saving approaches. Sections 3 and 4 discuss the approacheswhich have been proposed for saving dynamic and leakage energy,respectively. In these sections, we first discuss the landscape of thetechniques proposed and then discuss a few of them in more detail.We highlight the basis of similarities and differences between thetechniques. Since leakage energy is becoming an increasing fractionof cache power consumption [2,3,11], we focus more on leakageenergy saving techniques than dynamic energy saving techniques.Section 5 discusses the approaches which are used for saving both
dynamic and leakage energy. To show real-life implementation ofthe research ideas, Section 6 discusses some of the commercialchip designs which employ cache energy saving techniques. Finally,Section 7 concludes the paper.
dx.doi.org/10.1016/j.suscom.2013.11.001

http://www.sciencedirect.com/science/journal/22105379

http://www.elsevier.com/locate/suscom

http://crossmark.crossref.org/dialog/?doi=10.1016/j.suscom.2013.11.001&domain=pdf

mailto:[email protected]

dx.doi.org/10.1016/j.suscom.2013.11.001

3 Informatics and Systems 4 (2014) 33–43

rspcae(omadfi

2

iawphw

spw

P

P

HCqdt

evtoitupob

Lt(ctndi

lcFaLiac

Table 1Classification of techniques proposed for different caches.

Caches/criterion Energy-saving techniques (ESTs)

For first-level cache (L1) [23,24,27,28,34,39,71,148]For last-level caches (L2

or L3)[14–16,33,48,70,73,75,81,83,84,96]

For instruction caches [21,35,42,47,59,62,66,68,69,74,85,119,149,150]For data caches [9,24,28,28,38,40,58,63,148]

ESTs utilizing hardwaresupport

[26,31,39,58,60,64,65,71]

ESTs utilizing softwaresupport

[19,21,34,41,75,83,84,90,96]

ESTs utilizing compiler [35,36,52,63,85]

4 S. Mittal / Sustainable Computing:

As it is not possible to cover all the techniques in full detail in aeview of this length, we take the following approach to restrict thecope of the paper. Although the techniques designed to improveerformance are also likely to save energy, in this survey we onlyonsider those techniques which aim to optimize energy efficiencynd have been shown to improve energy efficiency. Further, cachenergy saving can also be achieved using circuit-level techniquese.g., low-leakage devices), however, in this paper we mainly focusn architecture-level techniques which allow runtime cache poweranagement. Lastly, since different techniques have been evalu-

ted using different simulation infrastructure and workloads, weo not include their quantitative improvement results. Rather, weocus on their key design principles, which can provide valuablensights.

. Background and related work

The power consumption of CMOS circuits is mainly classifiedn two parts, namely dynamic power (also called active power)nd leakage power (also called static power). In what follows,e present the modeling equations for both dynamic and leakageower in their simplified forms, which helps to gain insights intoow energy saving techniques work. For a more detailed modeling,e refer the reader to [12,13].

Dynamic power (Pdynamic) is dissipated whenever transistorswitch to change the voltage in a particular node, while leakageower (Pleakage) is dissipated due to leakage currents that flow evenhen the device is inactive. Mathematically, they are given by,

dynamic = ̨ × Ceff × V2DD × F (1)

leakage = VDD × Ileak × N × kdesign (2)

ere ̨ shows the activity factor, VDD shows the supply voltage,eff shows the effective capacitance and F shows the operating fre-uency. Further, N is the number of transistors, kdesign is the designependent parameter and Ileak is the leakage current, which is aechnology dependent parameter.

From Eq. (1) we infer that, for a given CMOS technology gen-ration, dynamic power consumption can be reduced by adjustingoltage and frequency of operation or by reducing the activity fac-or (e.g., by reducing the number of cache accesses or the numberf bits accessed in each cache access, etc.). Similarly, from Eq. (2),t is clear that, for a given CMOS technology generation, the oppor-unity of saving leakage energy lies in redesigning the circuit tose low-power cells, reducing the total number of transistors orutting some parts of caches into low (or zero) leakage mode. Basedn these essential principles, several architectural techniques haveeen proposed, which we discuss in the next sections.

Modern processor cores have multi-level cache hierarchy with1, L2, L3 caches etc. Also, typically at level one, data and instruc-ion caches are designed as separate caches, while at lower levelsi.e., at level 2 and 3), instruction and data caches are unified. Theseaches have different properties and different techniques utilizehese properties to save cache energy. Table 1 summarizes the tech-iques which are proposed for L1 or L2/L3, and for instruction orata caches. More detailed discussion of some of these techniques

s presented in the following sections.First-level caches (FLCs) are designed to minimize access

atency, while last level caches (LLCs) are designed to minimizeache miss-rate and the number of off-chip accesses. Accordingly,LCs are smaller (e.g., 32 KB or 16 KB), have smaller associativitynd employ parallel lookup of data and tag arrays [14]. In contrast,
LCs are much larger (e.g., 2 MB, 4 MB, etc.), have higher associativ-ty and employ serial (phased) lookup of data and tag arrays. Thus,s an example, an energy saving technique which increases theache access latency is more suitable for LLCs, than for FLCs. Also,
support

due to their relatively smaller sizes and large number of accesses,FLCs spend a larger fraction of their energy in the form of dynamicenergy, while LLCs spend a larger fraction of their energy in the formof leakage energy [15,16]. As for instruction/data caches, instruc-tion access stream exhibits strong spatial and temporal locality andhence, the instruction cache is very sensitive to increase in accesslatency. The working set and reuse characteristic of instructioncache is different from that of data cache. Also, since instructioncache does not hold dirty data, reconfiguring it does not lead towrite-back of dirty data and thus, reconfiguration of instructioncache can be more easily implemented than that of the data cache.

Table 1 also classifies the techniques based on whether theyneed hardware, software and/or compiler support. While compiler-based approaches incur no or minimal hardware overhead,compiler analysis may not be possible in all situations and alsocompiler only has limited information. Software-based approachescan leverage the software to make more complex decisions andconsider the impact of energy saving techniques on componentsof the processor other than the cache, however, software-onlyapproaches generally cannot exercise the opportunity of frequentreconfigurations and also incur larger implementation overheadthan hardware-only approaches. Hardware-based approaches canutilize simple yet low-overhead algorithms, and can also exer-cise the opportunity of fine-grained and frequent reconfigurations.However, these approaches cannot easily take other componentsof processor into account.

Kaxiras and Martonosi [17] survey some of the architecturaltechniques proposed for saving energy in processors and memoryhierarchies. This paper differs from their work, in that we reviewseveral recent developments which have been made in the fastevolving field of design of energy efficient architectural techniques.Also, we exclusively focus on the techniques aimed to save cacheenergy to provide more in-depth insights. Finally, to show the prac-tical application of the research ideas, we also discuss the examplesof many commercial chips which use cache energy saving designs.

3. Dynamic energy saving approaches

3.1. Overview

Recently several techniques have been proposed for savingdynamic energy. Before discussing them in detail, it is helpful tosee their underlying similarities and differences by classifying themacross several dimensions. Some techniques save dynamic energyby reducing the number of accesses to a particular level of cachehierarchy by using additional memory structures. These structures
are used either for data storage [18–22], or prediction of cacheaccess result [23–27] or for pre-determination of cache access result[28–33].

S. Mittal / Sustainable Computing: Inform

Table 2Classification of dynamic energy saving techniques.

Criterion Energy-saving techniques (ESTs)

ESTs using extra memory storage For data storage [18–22],For prediction of cache access result[23–27]For pre-determination of cache accessresult [28–33]

Reducing number of waysconsulted in each access

Using software [34],Compiler [35,36],Hardware [26,29,31–33,37–39]

Reducing switching activity Sequential cache-way access[23,26,41,42],Multi-step tag-bit matching [43],Reducing active tag bits or thoseactually compared [44–48]Accessing frequent (hot) data withlower energy [14,40]

io[qe

cimtfonwnitmdce

3

farTica

wrMwiiarcpa

ESTs for multicores ormultiprocessors

[25,36,50]

Some techniques reduce the number of cache ways accessedn each cache access by either using software information [34]r compiler information [35,36] or hardware-based approaches26,29,31–33,37–39]. A few techniques provision accessing fre-uent (hot) data with lower energy to reduce average dynamicnergy of cache access [14,40].

Some techniques trade-off access time for gaining energy effi-iency by performing the various tasks required for cache accessesn sequential manner instead of simultaneously (i.e., in parallel

anner). If a cache hit/miss decision has already been taken, fur-her tasks can be avoided for saving dynamic energy. For example, aew techniques access cache ways sequentially [23,26,41,42]; somether techniques perform matching of tag-bits in multi-step man-er [43], while some techniques reduce the number of tag bitshich are active or are required for comparison [44–48]. Tech-iques have also been proposed which reduce the data transferred

n each cache access (e.g., [49]). Also, while the above mentionedechniques can also be extended for saving dynamic energy in

ultiprocessor systems, some techniques have been especiallyesigned to address the issues related to saving energy in multipro-essor systems [25,36,50]. Table 2 provides an overview of dynamicnergy saving techniques.

.2. Discussion

Kin et al. [18] propose a small filter cache which is placed inront of the conventional L1 cache. By trying to serve most of theccesses from the data present in the filter cache, their techniqueeduces the number of L1 accesses and thus saves dynamic energy.he tradeoff involved in the use of filter cache is that for achiev-ng a reasonably high hit-rate in the filter cache, the size of filterache needs to be large, which, in turn, increases its access timend energy consumption.

Dropsho et al. [41] discuss an ‘accounting cache’ architecture,hich works on the temporal locality property of caches. It uses LRU

eplacement policy which places most frequently used blocks nearRU way-positions and thus, most accesses are likely to hit in thoseays. Using this idea, accounting cache logically divides cache ways

nto two parts, named primary and secondary. On any cache access,nitially only the primary ways are accessed; the secondary waysre accessed only if there is a miss in primary ways. Thus, due to the
eduction in the average number of way-comparisons, accountingache saves dynamic energy. Udipi et al. [14] propose ‘non-uniformower access’ in the cache, where certain ways of the cache areccessible at low energy using low-swing wires. These ways are
atics and Systems 4 (2014) 33–43 35

used as MRU ways. Thus, their technique saves energy by makingthe “common case” (i.e., hits to MRU ways) energy efficient.

Yang and Gupta [40] discuss a ‘frequent value’ data cachedesign, which divides the data cache into two arrays. For accessesmade to frequent (i.e., hot) cache lines, only the first data array isaccessed, while for accesses made to infrequent (i.e., cold) cachelines, both data arrays are accessed. Further, frequent cache linesare stored in encoded form and for accesses made to them, num-ber of bit comparisons are reduced, leading to saving in dynamicenergy.

Jones et al. [35] propose a technique for saving energy in instruc-tion caches. Their technique works by using the compiler to placethe most frequently executed instructions at the start of the binary.At runtime, these instructions are explicitly mapped to specificways within the cache. When these way-placed instructions arefetched, only the specific ways are accessed. This leads to saving incache dynamic energy.

Powell et al. [23] use the technique of predicting the cache way,which is likely to contain the data. On a cache-access, first onlya single way is accessed. Thus, on a correct prediction, the cachebehaves just like a direct mapped cache and the dynamic energy ofcache access is reduced. However, on misprediction, all the cacheways have to be accessed. Thus, a misprediction leads to increase incache access time and energy. Also, use of their technique leads tonon-uniform cache hit latency on a right and wrong prediction ofcache access result. To address this, way-selection based techniqueshave been proposed (see Section 5).

Zhu and Zhang [26] propose a technique which combines way-prediction and phased access mechanisms to reduce dynamicenergy of caches. Their technique uses the way-prediction modeto handle a cache hit and the phased mode (i.e., accessing tag arrayfirst and based on the result, accessing data array) to handle acache miss. Their technique uses simple predictors to predict theresult of cache access. When the access is predicted to hit, the way-prediction scheme determines the desired way and probes thatway only. When the access is predicted to miss, the phased-accessscheme accesses all the tags of the cache-set first and then onlythe appropriate cache way is accessed. Thus, their technique savescache energy by reducing the number of accesses to data arrays.

Tsai and Chen [20] propose a technique for improving energyefficiency of embedded processors by using a memory structurecalled “Trace Reuse cache”. The Trace Reuse cache is used at thesame level of memory hierarchy as conventional instruction cache.It works by reusing the retired instructions from the pipeline back-end of a processor to efficiently deliver instructions in the form oftraces. This enables the processor to sustain a higher instructionrate, which improves both performance and energy efficiency.

Ghosh et al. [29] propose a technique named ‘Way Guard’ tosave dynamic energy in caches. This technique uses a segmentedcounting Bloom filter [51] with each cache way. By accessing thisstructure before a cache access, the cache controller can determinewhether the requested cacheline is not present in a particular way.To save energy, the lookup of those cache ways can be avoidedwhich do not contain the requested data.

Park et al. [43] discuss techniques for reducing tag comparisonsby making an early estimation of cache hit or miss and using theresult for skipping tag comparisons, if possible. Their method tracksthe hotness or coldness (depending on the frequency of accessesmade) of a cache line and for cold cache lines, a partial tag com-parison is performed before full tag comparison to explore thepossibility of early termination of tag comparison to save energy.Additionally, their method first compares the tags of hot lines, and
only in case of a miss, it compares the tags of cold lines. Since bytemporal locality property of caches, most of the cache hits arelikely to occur in hot blocks, this method saves energy by reducingtag comparisons.

3 Inform

ctaastBpwtvatw

oipAbptie

utukosthttct

wastfase

agbbiaeideusew3

iD


Kwak and Jeon [44] propose a technique to reduce the poweronsumed in tag-accesses. Their technique works on the observa-ion that since program applications typically exhibit high memoryccess locality, most of the tag bits of successive memory addressesre expected to be same, except for a few differences in the LSB-ide bits. Thus, by storing the MSB-side bits in compressed form,he actual number of bits required for comparison can be reduced.ased on this, their technique logically divides the cache tag in twoarts, namely lower tag bits and higher tag bits. For applicationsith low memory access locality, the number of bits which are

aken as lower tag bits (i.e., LSB-side bits) need to be larger and viceersa. Further, the higher tag bits are stored in compressed form. Onny cache access, when tag-matching is performed; all the lowerag bits and only the compressed higher tag bits are compared,hich leads to saving of cache energy.

Guo et al. [52] present techniques to reduce the energy overheadf prefetching techniques. One of their techniques uses compilernformation to identify memory addresses which are useful forrefetching and then, prefetches only those memory addresses.nother technique proposed by them uses a small prefetch filteruffer (PFB) to reduce the overhead of L1 tag look-up related torefetching. PFB is used to store the most recently prefetched cacheags and a prefetching address is checked against PFB. If the addresss found in the PFB, prefetching is avoided, which leads to saving ofnergy.

Fang et al. [34] propose a method to save dynamic energy bytilizing software semantics in cache design. For saving instruc-ion cache energy, their technique works on the observation that aser-mode instruction fetch cannot hit in a cache line that containsernel-mode code. Based on this, on a user-mode instruction fetch,nly the cache-ways containing user-mode code are accessed. Foraving data cache energy, their technique works on the observa-ion that an access to stack data cannot hit in a way containingeap data and vice versa. Hence, they store an extra bit with eachag to identify whether the data belongs to stack or heap. Checkinghis bit at the time of access helps in early identification of cross-hecks between stack and heap. Using this information, checkinghose ways can be avoided which are sure to lead to cache miss.

Recently, researchers have proposed 3D-stacked DRAM caches,here 3D die stacking is used to interconnect a processor die with

stack of DRAM dies using through-silicon vias (TSVs) [53]. 3D dietacking technology promises lower interconnect power consump-ion, smaller chip footprint and increased on-chip bandwidth andor these reasons, this technology is already finding commercialdoption [54]. However, 3D die stacking technology also presentsignificant power and thermal challenges [55]. To address this, sev-ral architecture-level techniques have been proposed.

Jevdjic et al. [56] present a 3D stacked DRAM cache design whichims to achieve high energy efficiency and performance by intelli-ent cache management. In DRAM caches, using a small (e.g. 64B)lock size leads to optimized use of cache capacity and bandwidth,ut high lookup latency. On the other hand, using a large granular-

ty (e.g. 4 KB pages) leads to fast lookup and reduced tag overheadt the cost of poor bandwidth and cache capacity utilization. Jevdjict al. propose a cache design, which allocates data at granular-ty of pages, but fetches only those pages that will be accesseduring page’s residency in the cache. This avoids bringing unnec-ssary pages into the cache which improves cache and bandwidthtilization. The prediction of useful blocks is made by identifyingpatial correlation. The experimental results show that in terms ofnergy efficiency, their technique outperforms conventional (i.e.ithout die-stacking) cache and also block-based and page-based

D stacked DRAM cache designs.Sun et al. [57] propose a heterogeneous 3D DRAM architecture to

mplement both L2 cache and main memory within the 3D stackedRAM dies. Because of larger density of DRAM, larger sized DRAM

atics and Systems 4 (2014) 33–43

caches can be architected in the same area compared to SRAMcaches. Sun et al. study use of multiple (viz. 2, 4 and 8) layers ofstacked DRAM and show that their proposed 3D DRAM cache designoffers better energy efficiency than 2D SRAM design. Moreover, useof larger number of layers helps in reducing the access latency andenergy consumption.

4. Leakage energy saving approaches

4.1. Overview

As explained before, leakage energy saving approaches work byturning off a part of the cache to reduce the leakage energy con-sumption of the cache. Based on the data retentiveness of turned-offblocks, the leakage energy saving techniques are classified intotwo broad types, namely state-preserving and state-destroyingtechniques. The state-preserving techniques turn off a block whilepreserving its state (e.g., [58–61]). This means that when the blockis reactivated, it does not need to be fetched from next level ofmemory. State-destroying techniques (e.g., [62]) do not preservethe state of the turned-off block, but generally save more energy inthe low-power states than the state-preserving techniques. Severalmicroarchitecture level techniques utilize state-preserving leak-age control (e.g., [63–70]), while others employ state-destroyingleakage control (e.g., [37,71–84]). Some researchers have proposedtechniques which work with either of or both of state-preservingor state-destroying leakage control mechanism [16,85–87]. Liet al. [88] compare the effectiveness of state-preserving andstate-destroying techniques. They conclude that when the cost offetching a missed block is high, state-destroying techniques incura large penalty and thus, state-preserving techniques show supe-rior performance. However, when the cost of fetching a missedblock is not high, a state-destroying technique can be superior tothe state-preserving technique, since a state-destroying techniquecompletely turns off the block, thus saving more energy.

The energy saving techniques turn off cache at the granular-ity (unit) of certain cache space, such as a single way or a singleblock at a time. Based on this granularity, leakage energy savingtechniques can be classified as way-level [64,65,81,89–94], set-level (or bank-level) [74,93], hybrid (set and way) level [75,76,95],cache block-level [61,63,66–68,71–73,79,85,87], cache sub-blocklevel [82], cache color level [83,84,96] or cache sub-array level [77],etc.

To demonstrate the typical values of the different cache param-eters, we take the example of an 8-way set-associative cache of2 MB size with 64B block size and 8-byte sub-block. For comput-ing number of cache colors, we assume that the system page sizeis 4 KB. Then, the number of cache blocks is 32,768 and number ofsub-blocks is 2,62,144. The number of cache ways is 8 and the num-ber of cache colors is 64. The number of sets is 4096, however, it isnoteworthy that the selective-sets approach allocates cache only atgranularity of power of two sets, such as 4096 or 2048 or 1024 sets,etc. The cache sub-array level allocation approach in [77] divides a2MB cache into 6 heterogeneous parts (called sub-caches), whichhave the size of 1 MB, 512 KB, 256 KB, 128 KB, 64 KB and 64 KB. Thus,different techniques work at different granularities.

Reconfiguration at each of these levels of granularities hasits own advantages and disadvantages. Unlike selective-setsand cache-coloring approaches, selective-way approach does notrequire change in the set-decoding on cache reconfiguration, whichleads to smaller reconfiguration overhead and easier implemen-
tation. However, selective-way approach harms the associativityof the cache; for example, turning-off all but one way of a last-level cache turns it into a direct-mapped cache which leads to veryhigh miss-rate and off-chip accesses. Also, selective-way approach

Informatics and Systems 4 (2014) 33–43 37

poraprmsumatmormfibCunlgt

fiwetAnsacc

feoptitac[

iaotttpwa

tdp

h

d

Table 3Classification of leakage energy saving techniques (reconfig. = reconfiguration.)

Criterion Energy-saving techniques (ESTs)

Circuit-level ESTs State-preserving [58–61], State-destroying[62]

Micro-architectural ESTs State-preserving [63–70]State-destroying [37,71–84]Either or both [16,85–87]

Reconfig. granularity Way-level [64,65,89–94]Set-level (or bank-level) [74,93]Hybrid (set and way) level [75,76,95]Cache block-level[61,63,66–68,71–73,79,85,87]Cache sub-block level [82], Cache color level[83,84,96]Cache sub-array level [77]

Reconfig. frequency Static [89,92,94], dynamic (all in the next fiverows)

Dynamic reconfig Fixed large interval[64,66,71,74–77,79,81,83,84,93,95]Variable interval [67,72,73]Continuous reconfig.[16,61,63–65,68,82,85,86]

Basic property onwhich ESTs work

Inclusion property of cache hierarchies [16]Temporal locality[58,63,64,66–68,71–73,79,82,85]Varying working set size [74–76,83,84,95–97]

What is turned-off Only data array (and not tag array)[72,73,98]Both data and tag arrays (almost all others)

Profiling: offline or online Offline (or compiler analysis)[63,74,76,85,94,95,99–102]Online (almost all others)

Thermal aware ESTs [87,93,94,103,104]

ESTs for multi-cores/multi-processors

[15,37,80,86,90,96,98,101,102,105–115]

Integration with otherapproaches

DVFS [91,111,116–119]

Data compression [78,108,120,121],prefetching [122]

ESTs in differentapplication contexts

QoS systems [83,96], real-time systems[100,101,123]

S. Mittal / Sustainable Computing:

rovides limited granularity, which is at most equal to the numberf ways. To achieve high granularity with selective-ways approachequires use of highly associative caches, which also have highccess time and energy. Selective-sets approach can potentiallyrovide large granularity, however, in practice, it is observed thateducing the cache size below 1/8 or 1/16 significantly increases theiss-rate [74,75]. Cache coloring can provide better granularity and

maller reconfiguration overhead than selective-sets approach (bysing mapping-table [83,84]), however it also has higher imple-entation overhead. Hybrid (selective-sets and selective-ways)

pproach aims to provide higher granularity than either of thewo approaches and combine their benefits; however its imple-

entation overhead is higher than either of the selective-setsr selective-ways approach alone. Cache block-level reconfigu-ation provides much higher granularity than any of the aboveentioned approaches and does not change set-decoding on recon-

guration. This approach typically makes decision to turn-off eachlock locally based on the access/miss characteristics of each block.ache sub-block level reconfiguration provides the highest gran-larity, however, it incurs high implementation overhead. Alsoote that with increasing granularity, the reconfiguration decision

ogic generally becomes increasingly complex. Moreover, increasedranularity does not necessarily provide higher energy savings ifhe application does not benefit from it.

From the point of view of the time-interval at which recon-guration is done, the techniques can be classified1 based onhether they use a fixed (static) configuration throughout the

xecution (e.g., [89,92,94]) or use dynamic reconfiguration (i.e.,he configuration is dynamically changed during the execution).mong techniques using dynamic reconfiguration, some tech-iques switch cache blocks at the boundary of a fixed, large intervalize [64,66,71,74–77,79,81,83,84,93,95]; some techniques use vari-ble interval size (time length) [67,72,73], some techniques switchache blocks throughout the execution (for example, before or afterache accesses) [16,61,63–65,68,82,85,86].

To achieve cache energy saving, different techniques utilize dif-erent properties of the caches. Some techniques save energy byxploiting inclusion property of cache hierarchies [16] while somether techniques exploit generational nature or temporal localityroperty of caches [58,63,64,66–68,71–73,79,82,85]. A few otherechniques dynamically reconfigure the cache based on the work-ng set size2 size of applications and turnoff the rest of the cacheo save energy [74–76,83,84,95,97,96]. In multicore systems, thispproach extends to partitioning the cache between different appli-ations and turning off rest of the cache for saving cache power90].

Leakage energy saving techniques can also be classified depend-ng on whether they turnoff only data array or both data and tagrrays. For example, a few techniques turnoff only data arraysf unused sections and keep the tag arrays turned-on to guideheir algorithms [72,73,98], while most other techniques afford tournoff both data and tag arrays of unused sections. For guidingheir algorithms, some techniques require offline profiling or com-iler analysis for their function (e.g., [63,74,76,85,94,95,99–102]),hile most others use only runtime information for guiding their

lgorithms.Since leakage energy varies exponentially with the tempera-

ure, an increase in chip temperature increases the leakage energyissipation in caches, which, in turn, further increases the chip tem-erature. To take chip temperature into account while modeling

1 Some techniques have multiple variants with different characteristics andence, they are classified in multiple groups.2 Working set of an application is the the number of unique cache lines accessed

uring a given execution interval.

Embedded multitasking systems[99,101,124,125]

and minimizing leakage energy, several techniques have beenproposed [87,93,94,103,104]. Such techniques are referred to asthermal-aware or thermal-sensitive techniques. Also, while manyof the above mentioned techniques can be extended to multicoreor multiprocessor systems, several techniques have been especiallydesigned to address the issues arising in multicore or multiproces-sor systems [15,37,80,86,90,96,98,101,102,105–115].

To achieve additional amount of energy saving, cache energysaving techniques have been synergistically integrated with someother approaches, such as DVFS (dynamic voltage/frequencyscaling) [91,111,116–119], data compression [78,108,120,121],prefetching [122], etc. Further, cache leakage energy saving has alsobeen discussed in the context of QoS (quality-of-service) systems[83,96], real-time systems [100,101,123] and embedded multitask-ing systems [99,101,124,125], etc. Table 3 provides the overview ofleakage energy saving techniques.

4.2. Discussion

For both state-preserving and state-destroying leakage control,architectural techniques make use of some well-known circuit-level mechanisms. Powell et al. [62] propose a circuit design named‘gated Vdd’, which facilitates state-destroying leakage control. This

3 Inform

tgrtrdssitllt[[svntv

nohl‘becscs

dvftttccolc

cimtmltb

saaaltdomuld


echnique adds an extra transistor in the supply voltage path orround path of the SRAM (static random access memory) cell. Foreducing the leakage energy of the SRAM cell, this transistor isurned off and by stacking effect of the transistor, the leakage cur-ent is reduced by orders of magnitude. Similarly, Flautner et al. [58]iscuss a circuit design named ‘drowsy-cache’, which facilitatestate-preserving leakage control. This technique uses two voltageupplies to the cache, one of which is low voltage and the others high voltage. For reducing the leakage energy of the SRAM cell,he cache controller switches the operating voltage of the cell toow voltage, thus putting the cell in low-leakage mode. When thisine is accessed the next time, the supply voltage is again switchedo high, thus the cache-block consumes normal power. Kim et al.59] propose a “super-drowsy” circuit design and Agarwal et al.61] propose a gated-ground circuit design, both of which behaveimilar to the drowsy cache, except that they only require a singleoltage supply. Similarly, another state-preserving circuit design,amed multithreshold CMOS (MTCMOS), dynamically changes thehreshold voltage of the SRAM cell by modulating the backgate biasoltage to transition the cell to low-leakage mode [126].

Several energy saving techniques are based on the generationalature of cache access, which implies that cache lines have a periodf frequent use when they are first brought into the cache, and thenave a period of “dead time” before they are evicted. So, if a cache

ine has not been accessed for a certain number of cycles (calleddecay interval’ or ‘update window’), it indicates that the line hasecome dead and it can be put in low leakage mode for savingnergy. Using this principle, Flautner et al. [58] propose ‘drowsy-ache’ technique which puts the dead cache lines into low-powertate-preserving mode. Similarly, Kaxiras et al. [71] propose ‘decayache’ technique which puts the dead cache lines into low-powertate-destroying mode.

Several researchers have proposed improvements to the originalecay-cache technique. Since the optimal value of the decay inter-al varies with the applications, Zhou et al. [72] propose a techniqueor dynamically adapting decay interval for each application. Theirechnique only turns off data and keeps tags alive. Using tags, theirechnique estimates the hypothetical miss rate, which would behere if all the data lines were active. Then, the aggressiveness ofache line turning off is controlled to make the the actual miss ratelosely track the hypothetical miss rate. Abella et al. [73] keep trackf the interaccess time and the number of accesses for each cacheine and use this to compute suitable decay time for each individualache line.

Kadayif et al. [122] study the interaction of prefetching andache line turning-off. The prefetching mechanism is used tomprove the performance of the processor while the leakage control

echanism is used to save energy in caches of the processor. Thus,heir work studies how these two techniques interact and proposes

ethods to enable their synergistic operation. Since normal cacheines and those brought by prefetching have different usage pat-erns, their technique works by using different decay intervals foroth kinds of cache lines.

Petit et al. [64] use recency information of the set-associativetructure of caches to keep either a single or two MRU way(s) alivend switch rest of the ways to drowsy mode. Since most accessesre likely to hit in MRU way(s), this technique saves energy whilelso improving the number of hits to alive (i.e., non-drowsy) cacheines. Bardine et al. [81] use the ratio of hits to the MRU way and thato the least recently used active way in all the sets to estimate theegree of locality present in the memory access stream. A high valuef ratio indicates that most accesses hit near MRU way and hence, if
ore than two ways are enabled, a single cache way can be disabled
sing state-destroying leakage control mechanism. Conversely, aow value of the ratio indicates that cache hits are distributed overifferent ways and hence, a single cache way is enabled.


Zhao et al. [67] adapt the interval of transitioning the cache-blocks to drowsy mode by taking into account the reuse distance ofthe caches. The reuse distance of a memory access is defined as thenumber of distinct cache lines referenced since the last reference tothe requested line. The reuse distance reflects the temporal localityof the access pattern. A small reuse distance indicates that thereexists a strong likelihood of future reference and vice versa. Thus,instead of using a fixed interval based on the number of cycles, theirtechnique transitions a cache block to drowsy mode after a fixed Ndistinct references to the block. Thus, their technique also improvesthe number of hits to the alive cache lines.

Mohyuddin et al. [65] propose a technique for saving leakageenergy by maintaining different ways of a cache at different state-preserving power saving modes depending on their replacementpriorities. Going from the MRU way to the LRU way, cache lines arekept in increasingly aggressive power saving mode which also haveincreasingly larger penalties of cache line wakeup.

Chung and Skadron [68] adapt the drowsy cache technique forinstruction caches using branch predictor information. On an accessto the cache-block, the drowsy cache technique incurs the wakeuppenalty, which lies at the critical access path. To hide this latency,Chung et al. propose using the branch-predictor to identify thenext cache-block which would be accessed. Based on this, only thedesired cache-block can be woken up before the actual access. In thecase of branch mispredictions, the prediction of cache-block alsobecomes wrong, however, in such cases, the extra wakeup time ishidden due to the time taken in misprediction recovery. Zhang et al.[63] propose a compiler technique for saving cache leakage energy.This technique uses the compiler to perform program data reuseanalysis to determine the cache access pattern. Using this informa-tion, all the cache lines are placed into state-preserving low-leakagemode and a cache line is brought to normal power-mode just beforeit is accessed.

Since different programs and even different phases of an appli-cation have different cache requirement, several techniques savecache energy by dynamically reconfiguring the cache for each pro-gram or program phase and turning off the rest of the cache. Usingthis idea, Albonesi [89] proposes selective-ways approach wheresome of the ways of the cache are turned off to save energy. Yanget al. [74] discuss selective-sets approach where leakage energy issaved by turning off some of the sets of the cache. Yang et al. [95]also discuss selective sets and ways where both the number of setsand ways can be altered to save leakage energy in data and instruc-tion caches.

To dynamically reconfigure caches using the selective-waysapproach, program response for different number of cache waysneeds to be estimated. For this purpose, researchers gener-ally utilize utility monitors based on Mattson stack algorithm(e.g., [127,128]). Similarly, for utilizing selective-sets approach,researchers generally use set-sampling method and multiple aux-iliary tags for getting profiling information (e.g., [75]). Mittal et al.[75] present a hybrid set and way reconfiguration approach forleakage energy saving in last level caches. Their technique usesdynamic profiling for predicting cache usage and energy efficiencyof the application under multiple cache configurations. Using theseestimates, at the end of a fixed interval, the cache is reconfiguredto the best configuration.

Li et al. [16] discuss different techniques for saving cache leakageenergy by exploiting the data duplication across different levels ofthe cache hierarchy. Their technique works by putting an L2 blockin low leakage (either state-preserving or state-destroying) mode,when the block also exists in the L1 cache. Their technique essen-
tially tries to make the cache hierarchy non-inclusive for live cachelines.
Kotera et al. [90] use selective-ways technique in the context ofchip multiprocessors to achieve both cache partitioning and energy

Inform

scw

ebtttpwbp

ebtttai

ctptintfirpe

tlpsatbual

cocttTmtr

5

wBb

tsc


aving. Their technique works by allocating just suitable number ofache ways to different programs and turning off the rest of theays for saving cache energy.

Monchiero et al. [109] propose techniques to save leakagenergy in private snoopy L2 caches of chip multi-processors (CMP)y selectively switching off the infrequently used lines. One of theirechnique turns off cache blocks which have been invalidated dueo coherence protocol itself. The advantage of this technique ishat it does not induce extra misses and hence does not incur aerformance penalty. They also propose other techniques whichork by carefully choosing coherence-state transitions, on which a

lock is decayed, so that the leakage energy is saved with minimalerformance loss.

Reddy and Petrov [99] present an energy saving approach formbedded systems. They use an off-line algorithm to select theest cache partitioning for different running applications and usehis information at runtime. They also show the usefulness of theirechnique in reducing inter-task interference in a preemptive mul-itasking environment. Similarly, Paul and Petrov [125] proposen approach for partitioning instruction cache for saving energyn embedded multitasking systems.

Since most energy saving techniques aim to aggressively saveache energy, use of them may lead to large performance degrada-ion which may be unacceptable in QoS systems. Mittal et al. [83]resent a technique for saving cache energy in QoS systems. Theirechnique allocates cache at granularity of cache colors. Using aux-liary tag structure for different cache configurations (having sameumber of ways and different set counts), their technique predictshe cache energy and program performance for multiple cache con-gurations. Using this, in each interval, the cache is dynamicallyeconfigured to a suitable cache size such that the QoS target of therogram can be met, while saving maximum possible amount ofnergy.

Ku et al. [93] propose a thermal-aware leakage energy savingechnique, which works on the intuition that apart from savingeakage energy in turned off cache lines, leakage energy of activearts can also be saved by intelligently turning off cache lines,uch that chip temperature is reduced. Based on this, for a samemount of turned off cache; instead of turning off entire banks,heir technique turns off alternating rows of memory cells in theank. For the same number of turned off lines, compared to thermal-naware schemes, their scheme increases the distance betweenctive blocks, and thus reduces the chip temperature which alsoowers the leakage energy dissipation.

Compared to CPUs, GPUs (Graphics Processing Units) typi-ally use caches of smaller size, and hence, most of the workn cache energy saving has targeted CPU architecture. However,ache energy saving in GPUs has recently attracted the atten-ion of researchers. Wang et al. [129] discuss a microarchitecturalechnique for saving energy in both L1 and L2 caches in GPUs.hey propose putting the L1 cache in state-preserving, low-leakageode when there are no threads ready to be scheduled. Further,

he L2 cache is put in low-leakage mode when there is no memoryequest.

. Approaches for saving both dynamic and leakage energy

Several studies present reconfigurable cache architectureshich offer flexibility to change one or more parameters of cache.y taking advantage of the flexibility offered by these architectures,oth dynamic and leakage energy can be saved.

Zhang et al. [92] propose a highly configurable cache architec-ure which contains four separate banks that can operate as foureparate ways. By concatenating these ways, the associativity of theache can be altered and/or some ways can be shut down. Thus, the


associativity of the cache can be changed to either 1, 2 or 4. Simi-larly, by configuring the fetch unit to fetch different size of cachelines, the cache line size can also be altered. Wang and Mishra [100]and Rawlins and Gordon-Ross [108] use this architecture for savingcache energy. For example, Wang and Mishra profile several possi-ble configurations of L1 data cache, L1 instruction cache and unifiedL2 cache in offline manner and at runtime, explore different pos-sible combinations of two-level cache hierarchy to find the mostenergy efficient configuration. Similarly, Rawlins and Gordon-Rossdiscuss their technique for saving cache energy in heterogeneousdual-core systems by tuning L1 cache size, while addressing theissues presented by multicore operation such as core-interactions,data coherence etc.

Abella and González [130] propose a ‘Heterogeneous Way-sizecache’ where the number of sets in each cache way can be different.The only requirement is that the number of sets in each way shouldbe a power of two value. Note that the conventional caches usesame number of sets in each cache way. By adapting the size ofeach way according to the application requirement, their techniquesaves both dynamic and leakage energy.

Benitez et al. [77] propose ‘Amorphous cache’ which uses het-erogeneous sub-caches that can be selectively powered-off. ThusAmorphous cache allows changing the total cache size and/or set-associativity, depending upon the program requirement for savingboth leakage and dynamic energy.

Wong et al. [131] propose using different voltages for differentlevels of cache. Unlike drowsy cache technique [58], their techniquedoes not dynamically change the voltage of the cache block. Rather,throughout the execution, the cache is operated at a fixed voltage,which is lower than the core-voltage. Moreover, level two cache isoperated at lower voltage than the level one cache, which, in turn,can operate at lower voltage than the core. At the interface betweenthese two components, voltage level converters are used. Since thecache is operated at low voltage, both leakage and dynamic energyof access are saved.

Jiang et al. [107] propose a technique for saving energy inchip multiprocessors using asymmetric last-level caches. Theirapproach works by allocating suitable amount of cache to eachapplication; however their approach differs from conventionalapproaches based on cache reconfiguration or partition (such as[74,77]) in that, asymmetric caches are physically separated pri-vate caches of different sizes and to use them for achieving energyefficiency requires smart scheduling techniques. Thus, their tech-nique uses OS scheduler to assign applications with large workingsets on large caches and those with smaller working sets on smallercaches. Smaller caches reduce access energy and operating voltageand larger caches use cache line turnoff to save leakage energy.

Alves et al. [82] propose a technique for saving cache leakageand dynamic energy. Their technique predicts the usage pattern ofthe sub-blocks of a cache block, which includes which sub-blocksof a cache line will be actually used and how many times it will beused. This information is used to bring only those sub-blocks in thecache which are necessary and turns them off after they have beentouched the estimated number of times. Further, they augment thecache replacement policy to preferentially evict those cache blocksfor which all sub-blocks have become dead. Note that comparedto other techniques which utilize cache liveliness information (e.g.,decay cache [71] or drowsy cache [58]) and work on cache blocklevel, the technique proposed by Alves et al. works on cache sub-block level.

Several researchers have presented techniques for synergisti-cally using both leakage and dynamic energy saving techniques.
For example, Giorgi and Bennati [22] demonstrate that using fil-ter cache [18] reduces the number of accesses to L1 cache, which,in turn, enables effectively using leakage energy saving tech-niques in L1 caches. Similarly, Keramidas et al. [31] propose a

4 Inform

weausmcctsaaSnt

6

r

wsIULcpl

clTSmaS3d

IcwstSa

t9[PSa

cbcT4taFot


ay-selection based technique for additionally saving dynamicnergy in the caches which use decay-based leakage energy man-gement. Their technique works on the observation that in a cache,sing cache-decay mechanism [71] for saving leakage energy,everal cache-blocks may be dead. Thus, by making an early deter-ination of these dead blocks, the accesses to these cache blocks

an be avoided, which leads to saving of dynamic energy of theache. To this end, their technique uses a decaying Bloom filtero track liveliness information of each cache way of the cacheets. The Bloom filter enables an early prediction of cache miss,nd thus, based on this information, only selected cache ways areccessed, which leads to saving of dynamic energy of cache access.ince way-selection mechanism, unlike way-prediction mecha-ism, gives definite information about a cache miss, it always leadso uniform cache hit latency.

. Cache energy saving in real-world chips

In this section, we discuss a few commercial chips which provideuntime power management features.

Malik et al. [132] discuss Motorola M Core M340 processorhich provides the flexibility of turning-off ways of L1 cache for

aving energy. Gerosa et al. [133] discuss the design of a low-powerntel processor designed for Mobile Internet Devices (MID) andltra-Mobile PCs. This chip uses energy saving techniques both in1 and L2 caches. The L2 cache is 8-way, 512 KB cache and for appli-ations with low cache demand, up to 6 ways can be turned off usingower-gating and sleep transistors, resulting in 10× reduction in

eakage power.Chang et al. [134] discuss the design of 65-nm, 16 MB, on-die L3

ache for dual core Intel Xeon 7100 chip. This cache implementsow-power techniques to save both leakage and dynamic energy.o save leakage energy, state preserving techniques are used in theRAM array and peripherals which reduce the cache leakage byore than 2× over an unoptimized cache. To save dynamic energy,

t each cache access, only 0.8% of all array blocks are powered up.imilarly Intel’s 45 nm 8-core Enterprise Xeon processor [135] and2 nm Westmere processor series [136] also implement severalesign features for saving leakage and dynamic energy in caches.

Sakran et al. [137] discuss the architecture of 65 nm dual-corentel Merom processor chip which has 4 MB shared L2 cache. Thiship uses sleep transistors (STs) in memory arrays, decoders andrite drivers. Using STs, the leakage is reduced by 3 times while

till preserving the data. Further, the chip uses microarchitecturalechniques to identify low usage of the cache and then switches theTs of some parts of the cache to shut-off mode which reduces therray leakage by 7 times.

Gammie et al. [3] discuss ‘SmartReflex’ power managementechnology used by Texas Instruments mobile processors, such as0 nm OMAP2420 processor [138], 65 nm OMAP3430 processor139] and the 45 nm 3.5 G Baseband and Multimedia Applicationrocessor [140]. For saving both dynamic and leakage energy inRAM, these processors use techniques such as state-preservingnd state-destroying leakage control and voltage scaling.

George et al. [141] discuss the architecture of Intel 45-nm dual-ore chip, codenamed Penryn, which has several features for savingoth leakage and dynamic energy. Penryn is based on Core microar-hitecture and has a unified, 24-way L2 cache having a size of 6MB.he cache is organized in 1MB slices each containing 4096 lines and-ways. Also, each slice consists 32 data banks, each of which con-ains 2 sub-arrays. For saving leakage energy, the hardware design
llows turning off cache at the granularity of a single slice (4 ways).or saving dynamic energy, the cache controller activates only halff the sub-array for any L2 access. Branover et al. [142] discusshe architecture of AMD Fusion APU (accelerated processing unit),

named Llano which is designed with 32-nm technology. The LlanoAPU consists of four x86 CPU cores each of which has a private 1 MBL2 cache. For leakage energy saving, Llano provisions power gatingeach core and its associated L2 cache separately.

Zhang et al. [143] discuss the design of 65 nm SRAM which usessleep transistors for saving leakage energy. This SRAM can dynami-cally control sleep transistors to reduce leakage energy of the cell by3–5 times, while preserving its information content. Several otherSRAM designs also implement power management features, forexample [144–147].

7. Concluding remarks

Driven by continuous innovations in CMOS fabrication tech-nology, recent years have witnessed wide-spread use of multicoreprocessors and large sized on-chip caches for achieving high per-formance. However, due to this, total power consumption ofprocessors is rapidly approaching the “power-wall” imposed bythermal limitations of cooling solutions and power delivery. Thus,to be able to continue achieving higher performance using techno-logical scaling, managing the power consumption of processors hasbecome a vital necessity.

In this paper, we have reviewed several architectural techniquesproposed for managing dynamic and leakage power in caches. Wehave also discussed examples of commercial chips, which providemechanisms to save cache power at runtime. We believe thatour survey will enable researchers and engineers to understandthe state-of-the-art in microarchitectural techniques for improvingcache energy efficiency and motivate them to design novel solu-tions for addressing the challenges posed by future trends of CMOSfabrication and processor design.

References

[1] S. Murugesan, Harnessing green IT: principles and practices, IT Professional10 (1) (2008) 24–33.

[2] S. Borkar, Design challenges of technology scaling, Micro IEEE 19 (4) (1999)23–29.

[3] G. Gammie, A. Wang, H. Mair, R. Lagerquist, M. Chau, P. Royannez, S. Gurura-jarao, U. Ko, Smartreflex power and performance management technologiesfor 90 nm, 65 nm, and 45 nm mobile application processors, Proceedings ofthe IEEE 98 (2) (2010) 144–159.

[4] International Technology Roadmap for Semiconductors (ITRS), 2011.http://www.itrs.net/Links/2011ITRS/2011Chapters/2011ExecSum.pdf

[5] S. Borkar, Thousand core chips: a technology perspective, in: 44th AnnualDesign Automation Conference, ACM, 2007, pp. 746–749.

[6] First the Tick, Now the Tock: Next Generation Intel Microarchitecture(Nehalem), Tech. Rep., Intel Whitepaper, 2008.

[7] B. Stackhouse, et al., A 65 nm 2-billion transistor quad-core Itanium processor,IEEE Journal of Solid-State Circuits 44 (1) (2009) 18–31.

[8] R. Riedlinger, R. Bhatia, L. Biro, B. Bowhill, E. Fetzer, P. Gronowski, T.Grutkowski, A 32 nm 3.1 billion transistor 12-wide-issue Itanium® processorfor mission-critical servers, in: IEEE International Solid-State Circuits Confer-ence Digest of Technical Papers (ISSCC), 2011, pp. 84–86.

[9] A. Vardhan, Y. Srikant, Exploiting critical data regions to reduce data cacheenergy consumption, in: Tech. Rep., Indian Institute of Science, Bangalore,2013.

[10] S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen, N. Jouppi, McPAT: an integratedpower, area, and timing modeling framework for multicore and manycorearchitectures, in: 42nd IEEE/ACM International Symposium on Microarchi-tecture (MICRO), 2009, pp. 469–480.

[11] S. Rodriguez, B. Jacob, Energy/power breakdown of pipelined nanometercaches (90 nm/65 nm/45 nm/32 nm), in: International Symposium on LowPower Electronics and Design, ACM, 2006, pp. 25–30.

[12] J. Butts, G. Sohi, A static power model for architects, in: International Sympo-sium on Microarchitecture, 2000, pp. 191–201.

[13] D. Helms, E. Schmidt, W. Nebel, Leakage in CMOS circuits – an introduction,Integrated Circuit and System Design. Power and Timing Modeling, Optimiza-tion and Simulation (2004) 17–35.

[14] A. Udipi, N. Muralimanohar, R. Balasubramonian, Non-uniform power accessin large caches with low-swing wires, in: International Conference on HighPerformance Computing (HiPC), IEEE, 2009, pp. 59–68.

[15] S. Mittal, Dynamic Cache Reconfiguration Based Techniques for ImprovingCache Energy Efficiency, Iowa State University, 2013 (Ph.D. Thesis).

http://refhub.elsevier.com/S2210-5379(13)00051-6/sbref0005











































































http://www.itrs.net/Links/2011ITRS/2011Chapters/2011ExecSum.pdf























































































































































































































































































Inform
[16] L. Li, I. Kadayif, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. Irwin, A. Sivasub-ramaniam, Leakage energy management in cache hierarchies, in: IEEE PACT,2002, pp. 131–140.

[17] S. Kaxiras, M. Martonosi, Computer architecture techniques for power-efficiency, Synthesis Lectures on Computer Architecture 3 (1) (2008) 1–207.

[18] J. Kin, M. Gupta, W. Mangione-Smith, The filter cache: an energy efficientmemory structure, in: 30th International Symposium on Microarchitecture(MICRO), 1997, pp. 184–193.

[19] M. Rawlins, A. Gordon-Ross, On the interplay of loop caching, code com-pression, and cache configuration, in: 16th Asia and South Pacific DesignAutomation Conference, 2011, pp. 243–248.

[20] Y.-Y. Tsai, C.-H. Chen, Energy-efficient trace reuse cache for embedded pro-cessors, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 19(9) (2011) 1681–1694.

[21] A. Gordon-Ross, S. Cotterell, F. Vahid, Tiny instruction caches for low powerembedded systems, ACM Transactions in Embedded Computing Systems 2(4) (2003) 449–481.

[22] R. Giorgi, P. Bennati, Reducing leakage in power-saving capable caches forembedded systems by using a filter cache, in: Workshop on MEmory Per-formance: DEaling with Applications, Systems and Architecture, 2007, pp.97–104.

[23] M. Powell, A. Agrawal, T. Vijaykumar, B. Falsafi, K. Roy, Reducing set-associative cache energy via way-prediction and selective direct-mapping,in: 34th International Symposium on Microarchitecture, 2001, pp. 54–65.

[24] P. Carazo Minguela, R. Apolloni, F. Castro, D. Chaver, L. Pinuel, F. Tirado, L1 datacache power reduction using a forwarding predictor, Integrated Circuit andSystem Design. Power and Timing Modeling, Optimization, and Simulation(2011) 116–125.

[25] C. Ballapuram, A. Sharif, H. Lee, Exploiting access semantics and programbehavior to reduce snoop power in chip multiprocessors, in: ACM SigplanNotices, vol. 43, 2008, pp. 60–69.

[26] Z. Zhu, X. Zhang, Access-mode predictions for low-power cache design, MicroIEEE 22 (2) (2002) 58–71.

[27] B. Calder, D. Grunwald, J. Emer, Predictive sequential associative cache,in: International Symposium on High-Performance Computer Architecture,1996, pp. 244–253.

[28] D. Nicolaescu, B. Salamat, A. Veidenbaum, M. Valero, Fast speculative addressgeneration and way caching for reducing L1 data cache energy, in: Interna-tional Conference on Computer Design (ICCD), 2006, pp. 101–107.

[29] M. Ghosh, E. Ozer, S. Ford, S. Biles, H. Lee, Way guard: a segmented count-ing bloom filter approach to reducing energy for set-associative caches, in:International Symposium on Low Power Electronics and Design, 2009, pp.165–170.

[30] G. Memik, G. Reinman, W. Mangione-Smith, Just say no: benefits of early cachemiss determination, in: International Symposium on High-Performance Com-puter Architecture (HPCA), 2003, pp. 307–316.

[31] G. Keramidas, P. Xekalakis, S. Kaxiras, Applying decay to reduce dynamicpower in set-associative caches, High Performance Embedded Architecturesand Compilers (2007) 38–53.

[32] Z. Mingming, C. Xiaotao, Z. Ge, Reducing cache energy consumption by tagencoding in embedded processors, in: International Symposium on LowPower Electronics and Design (ISLPED), 2007, pp. 367–370.

[33] R. Min, W. Jone, Y. Hu, Location cache: a low-power L2 cache system, in: Inter-national Symposium on Low Power Electronics and Design (ISLPED), IEEE,2004, pp. 120–125.

[34] Z. Fang, L. Zhao, X. Jiang, S. Lu, R. Iyer, T. Li, S. Lee, Reducing L1 caches powerby exploiting software semantics, in: International Symposium on Low PowerElectronics and Design (ISLPED), 2012.

[35] T. Jones, S. Bartolini, B. De Bus, J. Cavazos, F. O’Boyle, Instruction cache energysaving through compiler way-placement, in: Design, Automation and Test inEurope (DATE), IEEE, 2008, pp. 1196–1201.

[36] C. Yu, P. Petrov, Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors,in: 5th IEEE/ACM International Conference on Hardware/Software Codesignand System Synthesis, 2007, pp. 245–250.

[37] K.T. Sundararajan, V. Porpodas, T.M. Jones, N.P. Topham, B. Franke, Cooper-ative partitioning: energy-efficient cache partitioning for high-performanceCMPs, in: International Symposium on High-Performance Computer Archi-tecture (HPCA), 2012.

[38] T. Kalyan, M. Mutyam, Word-interleaved cache: an energy efficient data cachearchitecture, in: ACM/IEEE International Symposium on Low Power Electron-ics and Design (ISLPED), 2008, pp. 265–270.

[39] K. Inoue, T. Ishihara, K. Murakami, Way-predicting set-associative cache forhigh performance and low energy consumption, in: International Symposiumon Low Power Electronics and Design, 1999, pp. 273–275.

[40] J. Yang, R. Gupta, Energy efficient frequent value data cache design, in: Inter-national Symposium on Microarchitecture (MICRO), 2002, pp. 197–207.

[41] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D.H. Albonesi, S.Dwarkadas, G. Semeraro, G. Magklis, M.L. Scott, Integrating adaptive on-chipstorage structures for reduced dynamic power, in: PACT, 2002.

[42] Z. Hongwei, Z. Chengyi, Z. Mingxuan, Improved way prediction policy for
low-energy instruction caches, Embedded Software and Systems (2007)425–436.
[43] H. Park, S. Yoo, S. Lee, A multistep tag comparison method for a low-powerL2 cache, IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems 31 (4) (2012) 559–572.


[44] J. Kwak, Y. Jeon, Compressed tag architecture for low-power embedded cachesystems, Journal of Systems Architecture 56 (9) (2010) 419–428.

[45] M. Loghi, P. Azzoni, M. Poncino, Tag overflow buffering: Reducing total mem-ory energy by reduced-tag matching, IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems 17 (5) (2009) 728–732.

[46] A. Shafiee, N. Shahidi, A. Baniasadi, Using partial tag comparison in low-power snoop-based chip multiprocessors, in: A. Varbanescu, A. Molnos, R.van, Nieuwpoort (Eds.), Computer Architecture, vol. 6161 of Lecture Notes inComputer Science, Springer, Berlin, Heidelberg, 2012, pp. 211–221.

[47] J. Gu, H. Guo, P. Li, Robtic: An on-chip instruction cache design for low powerembedded systems, in: 15th IEEE RTCSA, 2009, pp. 419–424.

[48] F.M. Sleiman, R.G. Dreslinski, T.F. Wenisch, Embedded way prediction forlast-level caches, in: IEEE 30th International Conference on Computer Design(ICCD), 2012, pp. 167–174.

[49] Y.K. Cho, S.T. Jhang, C.S. Jhon, Selective word reading for high performanceand low power processor, in: ACM Symposium on Research in Applied Com-putation, 2011, pp. 25–30.

[50] J. Nemeth, R. Min, W. Jone, Y. Hu, Location cache design and performanceanalysis for chip multiprocessors, IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems 19 (1) (2011) 104–117.

[51] L. Fan, P. Cao, J. Almeida, A. Broder, Summary cache: a scalable wide-areaweb cache sharing protocol, IEEE/ACM Transactions on Networking (TON) 8(3) (2000) 281–293.

[52] Y. Guo, P. Narayanan, M.A. Bennaser, S. Chheda, C.A. Moritz, Energy-efficienthardware data prefetching, IEEE Transactions on Very Large Scale Integration(VLSI) Systems 19 (2) (2011) 250–263.

[53] G.H. Loh, 3D-stacked memory architectures for multi-core processors, ACMSIGARCH Computer Architecture News 36 (2008) 453–464.

[54] http://www.tezzaron.com/technology/FaStack.htm (2013).[55] B. Black, et al., Die stacking (3D) microarchitecture, in: IEEE/ACM International

Symposium on Microarchitecture (MICRO), 2006, pp. 469–479.[56] D. Jevdjic, S. Volos, B. Falsafi, Die-stacked DRAM caches for servers: hit ratio,

latency, or bandwidth? Have it all with footprint cache, in: InternationalSymposium on Computer Architecture (ISCA), 2013.

[57] H. Sun, J. Liu, R.S. Anigundi, N. Zheng, J.-Q. Lu, K. Rose, T. Zhang, 3D DRAMdesign and application to 3D multicore systems, IEEE Design and Test ofComputers 26 (5) (2009) 36–47.

[58] K. Flautner, N. Kim, S. Martin, D. Blaauw, T. Mudge, Drowsy caches: sim-ple techniques for reducing leakage power, in: International Symposium onComputer Architecture (ISCA), 2002, pp. 148–157.

[59] N. Kim, K. Flautner, D. Blaauw, T. Mudge, Single-VDD and single-VT super-drowsy techniques for low-leakage high-performance instruction caches, in:International Symposium on Low Power Electronics and Design (ISLPED),2004, pp. 54–57.

[60] H. Hanson, M. Hrishikesh, V. Agarwal, S. Keckler, D. Burger, Static energyreduction techniques for microprocessor caches, IEEE Transactions on VLSISystems 1 (3) (2003) 303–313.

[61] A. Agarwal, H. Li, K. Roy, DRG-cache: a data retention gated-ground cache forlow power, in: Design Automation Conference, 2002, pp. 473–478.

[62] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, T. Vijaykumar, Gated-Vdd: a circuittechnique to reduce leakage in deep-submicron cache memories, in: Interna-tional Symposium on Low Power Electronics and Design (ISLPED), 2000, pp.90–95.

[63] W. Zhang, M. Karakoy, M. Kandemir, G. Chen, A compiler approach forreducing data cache energy, in: 17th Annual International Conference onSupercomputing, ACM, 2003, pp. 76–85.

[64] S. Petit, J. Sahuquillo, J. Such, D. Kaeli, Exploiting temporal locality in drowsycache policies, in: 2nd Conference on Computing Frontiers, ACM, 2005, pp.371–377.

[65] N. Mohyuddin, R. Bhatti, M. Dubois, Controlling leakage power with thereplacement policy in slumberous caches, in: 2nd Conference on ComputingFrontiers, ACM, 2005, pp. 161–170.

[66] J. Hu, A. Nadgir, N. Vijaykrishnan, M. Irwin, M. Kandemir, Exploiting programhotspots and code sequentiality for instruction cache leakage management,in: International Symposium on Low Power Electronics and Design, ACM,2003, pp. 402–407.

[67] Y. Zhao, X. Li, D. Tong, X. Cheng, Reuse distance based cache leakage control,in: 14th International Conference on High Performance Computing, Springer-Verlag, 2007, pp. 356–367.

[68] S. Chung, K. Skadron, On-demand solution to minimize I-Cache leakageenergy with maintaining performance, IEEE Transactions on Computers 57(1) (2008) 7–24.

[69] P. Kalla, X.S. Hu, J. Henkel, Distance-based recent use (DRU): an enhancementto instruction cache replacement policies for transition energy reduction, IEEETransactions on Very Large Scale Integration (VLSI) Systems 14 (1) (2006)69–80.

[70] A. Bardine, M. Comparetti, P. Foglia, C.A. Prete, Evaluation of leakage reduc-tion alternatives for deep submicron dynamic nonuniform cache architecturecaches, in: IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2013.

[71] S. Kaxiras, Z. Hu, M. Martonosi, Cache decay: exploiting generational behav-
ior to reduce cache leakage power, in: 28th International Symposium onComputer Architecture (ISCA), 2001, pp. 240–251.
[72] H. Zhou, M. Toburen, E. Rotenberg, T. Conte, Adaptive mode control: a static-power-efficient cache design, ACM Transactions on Embedded ComputingSystems 2 (3) (2003) 347–372.








































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































4 Inform
[73] J. Abella, A. González, X. Vera, M. O’Boyle, IATAC: a smart predictor to turn-offL2 cache lines, ACM Transactions on Architecture and Code Optimization 2(1) (2005) 55–77.

[74] S.-H. Yang, B. Falsafi, M.D. Powell, K. Roy, T.N. Vijaykumar, An integratedcircuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches, in: 7th International Symposium on High-PerformanceComputer Architecture (HPCA), 2001.

[75] S. Mittal, Z. Zhang, EnCache: improving cache energy efficiency using asoftware-controlled profiling cache, in: IEEE International Conference OnElectro/Information Technology, Indianapolis, USA, 2012.

[76] K. Sundararajan, T. Jones, N. Topham, Smart cache: a self adaptive cachearchitecture for energy efficiency, in: International Conference on EmbeddedComputer Systems (SAMOS), IEEE, 2011, pp. 41–50.

[77] D. Benitez, J. Moure, D. Rexachs, E. Luque, A reconfigurable cache memorywith heterogeneous banks, in: Design, Automation and Test in Europe Con-ference and Exhibition (DATE), IEEE, 2010, pp. 825–830.

[78] K. Tanaka, T. Kawahara, Leakage energy reduction in cache memory by datacompression, ACM SIGARCH Computer Architecture News 35 (5) (2007)17–24.

[79] J. Li, Y. Hwang, Snug set-associative caches: reducing leakage power whileimproving performance, in: International Symposium on Low Power Elec-tronics and Design, 2005, pp. 345–350.

[80] H. Kim, J. Kim, A leakage-aware L2 cache management technique forproducer-consumer sharing in low-power chip multiprocessors, Journal ofParallel and Distributed Computing (2011).

[81] A. Bardine, M. Comparetti, P. Foglia, G. Gabrielli, C. Prete, P. Stenström, Lever-aging data promotion for low power D-NUCA caches, in: 11th EUROMICROConference on Digital System Design Architectures, Methods and Tools (DSD),IEEE, IEEE, 2008, pp. 307–316.

[82] M.A.Z. Alves, et al., Energy savings via dead sub-block prediction, in:International Symposium on Computer Architecture and High PerformanceComputing (SBAC-PAD), 2012.

[83] S. Mittal, Z. Zhang, Y. Cao, CASHIER: A Cache Energy Saving Technique for QoSSystems, in: 26th International Conference on VLSI Design, IEEE, 2013, pp.43–48.

[84] S. Mittal, Z. Zhang, Palette: a cache leakage energy saving technique for greencomputing., in: C. Catlett, W. Gentzsch, L. Grandinetti, G. Joubert, J. Vazquez-Poletti (Eds.), HPC: Transition Towards Exascale Processing, Advances inParallel Computing, IOS Press, 2013.

[85] W. Zhang, J. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, M.Irwin, Compiler-directed instruction cache leakage optimization, in:International Symposium on Microarchitecture (MICRO), 2002, pp.208–218.

[86] M. Ghosh, H. Lee, Virtual exclusion: an architectural approach to reduc-ing leakage energy in caches for multiprocessor systems, in: InternationalConference on Parallel and Distributed Systems, vol. 2, IEEE, 2007, pp.1–8.

[87] S. Kaxiras, P. Xekalakis, G. Keramidas, A simple mechanism to adapt leakage-control policies to temperature, in: International Symposium on Low PowerElectronics and Design (ISLPED), 2005, pp. 54–59.

[88] Y. Li, D. Parikh, Y. Zhang, K. Sankaranarayanan, M. Stan, K. Skadron, State-preserving vs. non-state-preserving leakage control in caches, in: Design,Automation and Test in Europe Conference and Exhibition, vol. 1, IEEE, 2004,pp. 22–27.

[89] D.H. Albonesi, Selective cache ways: on-demand cache resource allocation,in: International Symposium on Microarchitecture, 1999, pp. 248–259.

[90] I. Kotera, K. Abe, R. Egawa, H. Takizawa, H. Kobayashi, Power-aware dynamiccache partitioning for CMPs, Transactions on High-Performance EmbeddedArchitectures and Compilers III (2011) 135–153.

[91] K. Meng, R. Joseph, R. Dick, L. Shang, Multi-optimization power managementfor chip multiprocessors, in: PACT, 2008, pp. 177–186.

[92] C. Zhang, F. Vahid, W. Najjar, A highly configurable cache architecture forembedded systems, in: International Symposium on Computer Architecture(ISCA), 2003, pp. 136–146.

[93] J. Ku, S. Ozdemir, G. Memik, Y. Ismail, Thermal management of on-chipcaches through power density minimization, in: International Symposiumon Microarchitecture (MICRO), 2005, pp. 283–293.

[94] H. Noori, M. Goudarzi, K. Inoue, K. Murakami, Improving energy effi-ciency of configurable caches via temperature-aware configuration selection,in: IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2008,pp. 363–368.

[95] S. Yang, B. Falsafi, M. Powell, T. Vijaykumar, Exploiting choice in resizablecache design to optimize deep-submicron processor energy-delay, in: Inter-national Symposium on High-Performance Computer Architecture (HPCA),2002, pp. 151–161.

[96] S. Mittal, Z. Zhang, MANAGER: A Multicore Shared Cache Energy Saving Tech-nique for QoS Systems, Tech. Rep., Iowa State University, 2013.

[97] G. Keramidas, C. Datsios, S. Kaxiras, A framework for efficient cache resizing,in: International Conference on Embedded Computer Systems (SAMOS), IEEE,2012, pp. 76–85.

[98] H. Kim, J. Ahn, J. Kim, Replication-aware leakage management in chip multi-
processors with private L2 cache, in: International Symposium on Low PowerElectronics and Design, 2010, pp. 135–140.
[99] R. Reddy, P. Petrov, Cache partitioning for energy-efficient and interference-free embedded multitasking, ACM Transactions on Embedded ComputingSystems (TECS) 9 (3) (2010) 16.


[100] W. Wang, P. Mishra, Dynamic reconfiguration of two-level caches in soft real-time embedded systems, in: IEEE Computer Society Annual Symposium onVLSI, 2009, pp. 145–150.

[101] W. Wang, P. Mishra, S. Ranka, Dynamic cache reconfiguration and partition-ing for energy optimization in real-time multicore systems, in: 48th DesignAutomation Conference, 2011, pp. 948–953.

[102] K.T. Sundararajan, T.M. Jones, N.P. Topham, The smart cache: an energy-efficient cache architecture through dynamic adaptation, in: InternationalJournal of Parallel Programming, 2012.

[103] L. Yuan, S. Leventhal, J. Gu, G. Qu, TALk: a temperature-aware leak-age minimization technique for real-time systems, IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 30 (10) (2011)1564–1568.

[104] L. He, W. Liao, M. Stan, System level leakage reduction considering the inter-dependence of temperature and leakage, in: Design Automation Conference,2004, pp. 12–17.

[105] M. Sato, R. Egawa, H. Takizawa, H. Kobayashi, A voting-based work-ing set assessment scheme for dynamic cache resizing mechanisms, in:IEEE International Conference on Computer Design (ICCD), 2010, pp.98–105.

[106] K. Kedzierski, F. Cazorla, R. Gioiosa, A. Buyuktosunoglu, M. Valero, Power andperformance aware reconfigurable cache for CMPs, in: Second InternationalForum on Next-Generation Multicore/Manycore Technologies, ACM, 2010.

[107] X. Jiang, A. Mishra, L. Zhao, R. Iyer, Z. Fang, S. Srinivasan, S. Makineni, P. Brett, C.Das, ACCESS: smart scheduling for asymmetric cache CMPs, in: InternationalSymposium on High Performance Computer Architecture (HPCA), 2011, pp.527–538.

[108] M. Rawlins, A. Gordon-Ross, CPACT – the conditional parameter adjustmentcache tuner for dual-core architectures, in: International Conference on Com-puter Design (ICCD), 2011, pp. 396–403.

[109] M. Monchiero, R. Canal, A. Gonzalez, Using coherence information and decaytechniques to optimize L2 cache leakage in CMPs, in: International Conferenceon Parallel Processing (ICPP), 2009, pp. 1–8.

[110] J. Zhao, C. Xu, Y. Xie, Bandwidth-aware reconfigurable cache design withhybrid memory technologies, in: International Conference on Computer-Aided Design, IEEE Press, 2010, pp. 48–55.

[111] H. Ghasemi, S. Draper, N.S. Kim, Low-voltage on-chip cache architecture usingheterogeneous cell sizes for high-performance processors, in: InternationalSymposium on High Performance Computer Architecture (HPCA), 2011, pp.38–49.

[112] K. Kedzierski, M. Moreto, F. Cazorla, M. Valero, Adapting cache partitioningalgorithms to pseudo-LRU replacement policies, in: International Symposiumon Parallel and Distributed Processing (IPDPS), 2010, pp. 1–12.

[113] X. Fu, K. Kabir, X. Wang, Cache-aware utilization control for energy efficiencyin multi-core real-time systems, in: 23rd Euromicro Conference on Real-TimeSystems (ECRTS), IEEE, 2011, pp. 102–111.

[114] H. Hajimiri, P. Mishra, S. Bhunia, Dynamic cache tuning for efficient memorybased computing in multicore architectures, in: IEEE International Conferenceon VLSI Design, 2013.

[115] M. Lodde, al. et, Dynamic last-level cache allocation to reduce area andpower overhead in directory coherence protocols, in: C. Kaklamanis, T. Pap-atheodorou, P. Spirakis (Eds.), Euro-Par 2012 Parallel Processing, vol. 7484 ofLecture Notes in Computer Science, Springer, 2012, pp. 206–218.

[116] A. Nacul, T. Givargis, Dynamic voltage and cache reconfiguration for lowpower, in: Design, automation and test in Europe-Volume 2, IEEE ComputerSociety, 2004.

[117] R. Jejurikar, C. Pereira, R. Gupta, Leakage aware dynamic voltage scaling forreal-time embedded systems, in: Design Automation Conference, IEEE, 2004,pp. 275–280.

[118] W. Wang, P. Mishra, Leakage-aware energy minimization using dynamicvoltage scaling and cache reconfiguration in real-time systems, in: 23rd Inter-national Conference on VLSI Design, 2010, pp. 357–362.

[119] Z. Ge, T. Mitra, W. Wong, A DVS-based pipelined reconfigurable instructionmemory, in: Design Automation Conference, 2009, pp. 897–902.

[120] H. Hajimiri, K. Rahmani, P. Mishra, Synergistic integration of dynamic cachereconfiguration and code compression in embedded systems, in: Interna-tional Green Computing Conference and Workshops (IGCC), IEEE, 2011, pp.1–8.

[121] S. Kim, J. Lee, J. Kim, S. Hong, Residue cache: a low-energy low-area L2 cachearchitecture via compression and partial hits, in: International Symposiumon Microarchitecture, 2011, pp. 420–429.

[122] I. Kadayif, A. Zorlubas, S. Koyuncu, O. Kabal, D. Akcicek, Y. Sahin, M.Kandemir, Capturing and optimizing the interactions between prefetchingand cache line turnoff, Microprocessors and Microsystems 32 (7) (2008)394–404.

[123] Y.-J. Chen, C.-L. Yang, J.-W. Chi, J.-J. Chen, TACLC: timing-aware cache leak-age control for hard real-time systems, IEEE Trans. Comput. 60 (6) (2011)767–782.

[124] W. Wang, S. Ranka, P. Mishra, A general algorithm for energy-aware dynamicreconfiguration in multitasking systems, in: 24th International Conferenceon VLSI Design, 2011, pp. 334–339.

[125] M. Paul, P. Petrov, Dynamically Adaptive I-Cache Partitioning for Energy-Efficient Embedded Multitasking., IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems 19 (11) (2011) 2067–2080.

[126] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami, T.Arakawa, H. Hamano, A low power SRAM using auto-backgate-controlled




































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Inform

[

[

[

[

[

[

[

[

[

[

[

[

[

[


MT-CMOS, in: International Symposium on Low Power Electronics andDesign, IEEE, 1998, pp. 293–298.

127] R.L. Mattson, Evaluation techniques in storage hierarchies, IBM Journal ofResearch and Development (2010) 9.

128] M.K. Qureshi, Y.N. Patt, Utility-based cache partitioning: a low-overhead,high-performance, runtime mechanism to partition shared caches, in: Inter-national Symposium on Microarchitecture, 2006, pp. 423–432.

129] Y. Wang, S. Roy, N. Ranganathan, Run-time power-gating in caches of GPUsfor leakage energy savings, in: Design, Automation Test in Europe ConferenceExhibition (DATE), 2012, pp. 300–303.

130] J. Abella, A. González, Heterogeneous way-size cache, in: International Con-ference on Supercomputing, ACM, 2006, pp. 239–248.

131] W. Wong, C. Koh, Y. Chen, H. Li, VOSCH: voltage scaled cache hierarchies, in:25th International Conference on Computer Design (ICCD), IEEE, 2007, pp.496–503.

132] A. Malik, B. Moyer, D. Cermak, A low power unified cache architecture provid-ing power and performance flexibility, in: International Symposium on LowPower Electronics and Design (ISLPED), 2000, pp. 241–243.

133] G. Gerosa, S. Curtis, M. D’Addeo, B. Jiang, B. Kuttanna, F. Merchant, B. Patel, M.Taufique, H. Samarchi, A sub-1W to 2W low-power IA processor for mobileinternet devices and ultra-mobile PCs in 45 nm hi-� metal gate CMOS, in:IEEE International Solid-State Circuits Conference (ISSCC). Digest of TechnicalPapers, 2008, pp. 256–611.

134] J. Chang, M. Huang, J. Shoemaker, J. Benoit, S. Chen, W. Chen, S. Chiu, R. Gane-san, G. Leong, V. Lukka, et al., The 65-nm 16-MB shared on-die L3 cache for thedual-core intel xeon processor 7100 series, IEEE Journal of Solid-State Circuits42 (4) (2007) 846–852.

135] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta,S. Kottapalli, S. Vora, A 45 nm 8-core enterprise Xeon processor, IEEE Journalof Solid-State Circuits 45 (1) (2010) 7–14.

136] N. Kurd, S. Bhamidipati, C. Mozak, J. Miller, T. Wilson, M. Nemani, M. Chowd-hury, Westmere: a family of 32 nm IA processors, in: IEEE InternationalSolid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010, pp.96–97.

137] N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, A. Kovacs, The implemen-tation of the 65 nm dual-core 64b Merom processor, in: IEEE InternationalSolid-State Circuits Conference (ISSCC). Digest of Technical Papers, 2007, pp.106–590.

138] P. Royannez, H. Mair, F. Dahan, M. Wagner, M. Streeter, L. Bouetel, J. Blasquez,H. Clasen, G. Semino, J. Dong, et al., 90 nm low leakage SoC design tech-niques for wireless applications, in: IEEE International Solid-State CircuitsConference (ISSCC), Digest of Technical Papers, 2005, pp. 138–589.

139] H. Mair, A. Wang, G. Gammie, D. Scott, P. Royannez, S. Gururajarao, M. Chau, R.Lagerquist, L. Ho, M. Basude, et al., A 65-nm mobile multimedia applications
processor with an adaptive power management scheme to compensate forvariations, in: IEEE Symposium on VLSI Circuits, 2007, pp. 224–225.
140] G. Gammie, A. Wang, M. Chau, S. Gururajarao, R. Pitts, F. Jumel, S. Engel, P. Roy-annez, R. Lagerquist, H. Mair, et al., A 45 nm 3.5 g baseband-and-multimediaapplication processor using adaptive body-bias and ultra-low-power


techniques, in: IEEE International Solid-State Circuits Conference (ISSCC),Digest of Technical Papers, 2008, pp. 258–611.

[141] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V. Naydenov,T. Khondker, S. Sarkar, P. Singh, Penryn: 45-nm next generation Intel® core2 processor, in: IEEE Asian Solid-State Circuits Conference (ASSCC), 2007, pp.14–17.

[142] A. Branover, D. Foley, M. Steinman, AMD fusion APU: llano, IEEE Micro 32 (2)(2012) 28–37.

[143] K. Zhang, U. Bhattacharya, Z. Chen, F. Hamzaoglu, D. Murray, N. Vallepalli,Y. Wang, B. Zheng, M. Bohr, SRAM design on 65-nm CMOS technology withdynamic sleep transistor for leakage reduction, IEEE Journal of Solid-StateCircuits 40 (4) (2005) 895–901.

[144] Y. Wang, U. Bhattacharya, F. Hamzaoglu, P. Kolar, Y. Ng, L. Wei, Y. Zhang, K.Zhang, M. Bohr, A 4.0 GHz 291 Mb voltage-scalable SRAM design in a 32 nmhigh-k+metal-gate CMOS technology with integrated power management,IEEE Journal of Solid-State Circuits 45 (1) (2010) 103–110.

[145] F. Hamzaoglu, K. Zhang, Y. Wang, H. Ahn, U. Bhattacharya, Z. Chen, Y. Ng, A.Pavlov, K. Smits, M. Bohr, A 3.8 GHz 153 Mb SRAM design with dynamic sta-bility enhancement and leakage reduction in 45 nm high-k metal gate CMOStechnology, IEEE Journal of Solid-State Circuits 44 (1) (2009) 148–154.

[146] Y. Wang, H. Ahn, U. Bhattacharya, T. Coan, F. Hamzaoglu, W. Hafez, C. Jan, R.Kolar, S. Kulkarni, J. Lin, et al., A 1.1 GHz 12�A/Mb-leakage SRAM design in65 nm ultra-low-power CMOS with integrated leakage reduction for mobileapplications, in: IEEE International Solid-State Circuits Conference (ISSCC),Digest of Technical Papers, 2007, pp. 324–606.

[147] M. Khellah, N. Kim, J. Howard, G. Ruhl, Y. Ye, J. Tschanz, D. Somasekhar, N.Borkar, F. Hamzaoglu, G. Pandya, et al., A 4.2 GHz 0.3 mm2 256 kb dual-V/subcc/SRAM building block in 65 nm CMOS, in: IEEE International Solid-StateCircuits Conference (ISSCC), Digest of Technical Papers, 2006, pp. 2572–2581.

[148] S. Kim, J. Lee, Write buffer-oriented energy reduction in the L1 data cache oftwo-level caches for the embedded system, in: 20th Great Lakes Symposiumon VLSI, ACM, 2010, pp. 257–262.

[149] N. Kim, K. Flautner, D. Blaauw, T. Mudge, Drowsy instruction caches. leak-age power reduction using dynamic voltage scaling and cache sub-bankprediction, in: 35th Annual IEEE/ACM International Symposium on Microar-chitecture (MICRO), 2002, pp. 219–230.

[150] Z. Xie, D. Tong, X. Cheng, WHOLE: a low energy I-Cache with separate wayhistory, in: IEEE International Conference on Computer Design, 2009, pp.137–143.

Sparsh Mittal received the B.Tech. degree in electronics and communications engi-neering from IIT, Roorkee, India and the Ph.D. degree in computer engineering fromIowa State University, Ames, IA, USA. He is currently working as a Post-DoctoralResearch Associate at Oak Rige National Laboratory (ORNL, TN, USA). In his B.Tech.
degree, he was the graduating topper of the batch and his major project was awardedinstitute silver medal. He was awarded scholarship and fellowship from IIT Roor-kee and Iowa State University. His research interests include non-volatile memory,memory system power efficiency, cache architectures in multicore systems, andGPU architecture.
























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Documents

A survey of architectural techniques for improving cache power efficiency