Limits on voltage scaling for caches utilizing fault tolerant techniques

Preview:

Citation preview

Limits on Voltage Scaling for Caches Utilizing Fault Tolerant Techniques

Mohammad A Makhzan, Amin Khajeh, Ahmed Eltawil, and Fadi Kurdahi University of California Irvine, USA

{mmakhzan,akhajehd,aeltawil,kurdahi}@uci.edu

Abstract

This paper proposes a new low power cache architecture that utilizes fault tolerance to allow aggressively reduced voltage levels. The fault tolerant overhead circuits consume little energy, but enable the system to operate correctly and boost the system performance to close to defect free operation. Overall, power savings of over 40% are reported on standard benchmarks. 1. Introduction

In this paper we introduce a novel solution for reducing power consumption in high speed processor caches at advanced process technologies. We explore the effect of reducing the supply voltage in L1 caches while keeping the same performance level. We demonstrate that by keeping the cell access time constant while reducing the cache voltage, the probability of failure per memory cell increases. We then introduce a new fault tolerant architecture to address the increased error rates. This architecture guarantees the correct operation of the cache and at the same time creates a virtually defect-free view of the cache for the processor. The proposed scheme is readily extensible to memory hierarchies including L2 and L3 caches and could also be used to deal with manufacturing defects. In prior work, fault tolerance has been approached purely from a manufacturing perspective to allow highly reliable circuits. The alter ego of this approach which is trading off power consumption by allowing and correcting for minimal faults has not been studied in details. By lowering the voltage in the cache, power consumption is reduced but as a side effect, due to process variation, the number of defective cells increases. In order to overcome this problem we add extra hardware to the cache, so as to reduce the miss rate caused by defective blocks. Therefore our primary focus is on power consumption such that, after adding the extra fault tolerant hardware,

our solution should still consume less energy as compared to traditional cache architecture.

2. Prior work

Cache is a high speed SRAM memory used to bridge the speed gap between the processor and slower Lower Level Memories (LLM). For accessing LLM, one pays the price of longer access time and large power consumption. By reducing the number of accesses to LLM, cache decreases the access time and improves the overall dynamic power consumption. It is the temporal and spatial locality of instruction and data that allows the architect to use caches for holding data and instructions closer to the CPU. If a cache is defective the system could still operate correctly providing that the faulty cache block or word could be disabled. One way to achieve this is by marking a defective cache word/block with an extra bit that can be added to the set of flag bits of that block, providing that the added bit is not defective. In prior work, this bit was called the Fault Tolerance bit (FT-bit) [2]. The set of FT-bits for memory words or blocks is called defect map. This defect map is used to turn a cache line off in case it is faulty. Turning a cache line off in an associative cache reduces the degree of associatively by one. In a direct mapped cache, on the other hand, every access to a disabled line will result in a miss. This scheme of disabling faulty blocks is done even when the cache is protected by single-error correcting, double-error detecting (SEC-DED) codes [4]. Replacement techniques, such as extra rows and columns, are also used in caches for yield enhancement [5] and for tolerating lifetime failures [6][7][8][9]. With replacement techniques, there is no performance loss in caches with faults. However, the number of extra resources limits the number of faults that can be tolerated using these techniques. Another form of redundancy is the use of extra bits per word to store an error correcting code [6]. Sohi investigated the application of a Single Error Correcting and Double Error Detecting (SEC-DED) Hamming code in an on-

1-4244-1258-7/07/$25.00 ©2007 IEEE 488

chip cache memory and found out that it degrades the overall memory access time significantly.

The work in [10] suggested resizable caches. In this technique it is assumed that in a cache layout, two or more blocks are laid in one row, therefore the column decoders are altered to choose another block in the same row if the original block is defective. In a technique called PADded caches [2] the decoder is modified to allow remapping of the defective blocks to other locations of the cache without destroying the temporal locality of data or instruction.

In this paper, a new architecture is presented specifically addressing power and yield improvement via fault tolerance. The ability to tolerate process and operational condition induced faults allows aggressive voltage reduction which in turns leads to reduced power consumption

3. Effect of voltage scaling on memory

Classically, failures in embedded memory cells are categorized as either of a transient nature, dependent on operating conditions, or of a fixed nature due to manufacturing errors. Symptoms of these failures are expressed as either: (1) an increase in cell access time, or (2) unstable read/write operations.

In process technologies greater than 100nm, fixed errors are predominant, with a minority of the errors introduced due to transient effects. This model cannot be sustained as scaling progresses due to the random nature of the fluctuation of dopant atom distributions and variation in gate length. In fact, in sub 100nm design, Random Dopant Fluctuation (RDF) has a dominant impact on the transistors’ strength mismatch and is the most noticeable type of intra-die variation that can lead to cell instability and failure in embedded memories [16]. RDF has a detrimental effect on transistors that are co-located within one cell by creating a mismatch in their intrinsic threshold voltages, Vth. Furthermore, these effects are a strong function of the operating conditions (voltage, frequency, temperature etc.) To model the effects of RDF on the probability of bit failures within a memory array, a simulation was setup where RDF effects are lumped into an independent Gaussian distribution characterizing the Vth fluctuations of each transistor [17]. The circuit under test is a standard six transistor SRAM memory bit cell. The SPICE models used for the simulation were obtained from the Predictive Technology Model (PTM) [13] website in 65nm. In the simulations Vdd has been lowered from the nominal Vdd0 = 0.9 V [18] to 0.6 V, then the cell failure statistics were calculated assuming a constant access rate. The total cell failure probabilities including read, write and destructive read probabilities for different Vdd is shown

in Figure 1[19]. Similar curves were obtained for other technologies, such as 45nm and 32nm.

0.6 0.7 0.8 0.9

10−15

10−10

10−5

Vdd

(v)

P(e

)

Figure 1: Probability of error for 65 nm 6T

SRAM cell 4. Effect of Vdd scaling on memory logic

In this section, as a case study we focus on a three stage pipeline cache [1], but note that this technique could be applied to wave pipelined and non-pipelined caches as well. In a three stage pipelined cache the word-line to the sense-amp via bit-line is the longest pipeline stage. This stage is analog and can not be pipelined into further stages [1]. Figure 2 illustrate how the timing looks like in a three stage pipelined cache working at nominal Vdd. When pipelined, every stage in the cache has one cycle to complete its operation. Cycle length is determined by the delay of longest stage in pipeline. Table 1 provides the timing for a 16KB cache obtained from CACTI 4.2 [11]. The decoding and output driver stage delay is smaller than the word-line to sense-amp. Therefore when pipelined the decoding and output stage finish earlier than their given cycle time.

Figure 2: Converting the access time of a non pipelined cache to a pipelined cache

Reducing voltage on the logic (decoder and output

driver) increases their delay. The delay of the decoder and output stage logic is bounded by the cycle time. The reduction of the voltage is also bounded by the minimum voltage that the logic needs to operate as a switching device. As long as we operate with a margin above the threshold voltage we do not need to be concerned with failure of logic. The delay of the logic [12] is related to the voltage by the equation

3.1)( thdd

dd

VVV

Delay−

= (1)

489

Table 1 shows the CACTI estimates of various stage delays in a 16KB cache whose parameters are described in the same table. Table 1 data includes the delay of the decoder, memory cell, and the output logic for both tag and data. In this paper, we assume that all components are controlled by a single variable supply voltage. This assumption is made for the sake of simplicity and implementation purposes. In case of latches we should make an exception and consider the latches to always work on the cache nominal Vdd. The first and third stages in 3 stage pipelines have a lower limit on how much their voltage could be reduced before the stage delay exceeds the second stage delay (which defines the cache cycle time). In the proposed cache, when reducing the voltage, the tag decoding stage delay reaches the cycle time limit before other logic stages and since all stages are connected to a single supply voltage, this voltage defines the lower limit of Vdd reduction.

ata stage delays(ns) Tag stage delays(ns)

decode 0.13958 decode 0.1709

total 0.13958 Total 0.1709

Word-line 0.12155 Word-line 0.0778

Bit-line 0.10542 Bit-line 0.05

sense amp 0.07263 Sense amp 0.0446

total 0.29959 Total 0.1724

output driver 0.1031 compare 0.1088 valid signal 0.0344 total 0.1031 Total 0.1432

Cache Parameters:

Number of banks 1 Block size(bytes) 16 Total Cache Size (bytes) 16386 Read Port Per bank 1

Number of sets per bank 1024 Write Port per Bank 1

Associativity 1 Technology Size 70 Table 1: Cache parameters for a 16KB cache

In the next section we introduce the proposed fault

tolerant solution that preserves the power savings (due to Vdd scaling) by reducing the miss rate.

5. Proposed architecture

Figure 3 illustrates our purposed architecture. In addition to the CACHE block (i.e. the traditional cache), two other blocks are added, the Bit Lock Block (BLB) and the Inquisitive Defect Cache (IDC). BLB, CACHE, and IDC receive the same address from CPU. The BLB is an off cache defect map and the IDC is a small cache that supports fault tolerant operation. The following sections describe the functionality of each of these components.

Figure 3: Proposed architecture

5.2 Bit Lock Block (BLB) defect map

To create a defect map we use one bit for each word in cache. The defect bit of each word if kept inside the tag bits of its associated cache line (traditional case) is prone to the same probability of failure as other bits in the cache. To insure correctness we utilized a defect map outside the cache always operating at its nominal supply voltage. A word based defect map requires more area/energy than a defect map with one bit per block. However, this fine division makes a significantly larger portion of the defective cache usable. Depending on how it is updated, the BLB may be used for storing temperature induced defects as well as process variation defects caused as a result of random dopant fluctuation or gate length variation. After identifying a defective cell, the word containing that cell is marked as defective. Experimental results from physical measurements in [20] show that strong correlation exists on intra-die gate length variation. Therefore gate length variation results in failure of few neighbor cells. Since our defect map marks the entire word as defective; gate variation defects are covered. Taking advantage of temporal and spatial locality of accesses to the cache, we have added a buffer to act as a small cache for BLB. Access to the BLB is serialized via the BLB buffer. The BLB buffer improves the system power/performance metrics trough (1) reducing the dynamic power consumption by reducing the number of accesses to the BLB, and (2) reducing the access time to obtain BLB information by sitting closer to the CPU. Each time the BLB is accessed, the buffer is updated with the TAG of the accessed instruction shifted to the right by )_(2 sizrBufferLog and the spatially close defect map bits which could be fetched by one access to BLB. Temporal and spatial locality of access in the program increases the likelihood that the BLB buffer contains the defect information corresponding to the next cache access thus removing the need to access the BLB.

490

5.2 Updating the BLB

Initially, the BLB map needs to be populated with the chip-specific defect locations. This can be accomplished after manufacturing, or at power up using a Built-in Self-Test (BIST) circuit. Updating the BLB at manufacture time is an expensive solution since testing for multiple error vectors will significantly add to the final chip cost. This analysis favors the boot time defect detection for generation of BLB by using the BIST infrastructure. In addition, we still need to consider the temperature induced errors and update the defect map dynamically. This is achieved in one or a combination of the following ways: The first solution is the use of parity bits. Every time that a new error is discovered, its word location is registered in a very small memory (in our case 16 or 32 entry). Since the newly discovered defect could be a soft error, BLB is not updated immediately. If the defective word is detected again then it is considered a temperature induced error and is registered in BLB. Each BLB row has a dirty bit which is set if a bit in that row is updated. The presence of the dirty bit requires a write back of the BLB information when changing the voltage level. The second approach is to run the built in BIST engine periodically. A third approach which is already being adopted by many manufacturing companies in 65nm and technologies and below, is the use of a Dynamic Temperature Sensing infrastructure (DTS). The DTS is geared towards avoiding overheating but can also be used for other purposes such as modifying the policy stacks. In this approach, the system senses the cache temperature. When a temperature increase exceeding a pre-specified amount is detected, it ignites the BIST engine to test the heated locations for temperature induced defects. In this case, after detection of any defects the BLB is updated and the dirty bit is set. Since temperature is a very slow changing variable such BIST updates will not run frequently. In addition, as the chip ages the BLB coverage increases and less BLB updates are required. Finally, another approach which could be combined with previous methods is an adaptive technique which decides on the voltage level by monitoring the number of dynamically generated defects. In this adaptive approach the voltage will be increased to mitigate/mask the high rate of dynamically generated errors. Every time the voltage changes, the incremental defect map is loaded into BLB and BLB buffer is flushed. If there are dirty bits a write back is needed before loading the new defect map. The BLB itself is very small compared to the cache and therefore the former can operate much faster than the later. Thus, the

BLB buffer access and BLB decoding are both done in one cache clock cycle. The second and third stages of BLB access could also be performed in next cycle without exceeding the cache cycle time. Furthermore, the BLB voltage is not scaled and therefore its delay never exceeds the cache cycle time. 5.3 Inquisitive Defect Cache memory

The Inquisitive defect cache is a small cache which acts as a place holder for defective cache words. The size of IDC cache could be much smaller than the total size of all the defective words in a cache. If IDC size and associatively are chosen properly, the execution window of a process in the cache (after its first pass) should experience very little defective/disabled words. Due to space limitation we only present the direct mapped IDC in this paper.

Figure 4: In the first pass defects in window of execution are mapped to IDC.

If a word is defective, information for that word

could only be found in the IDC. Every read/write access to defective words is mapped to the IDC. Based on our simulations, the BLB buffer tag comparison and bit masking take less time than IDC data and tag decoding (by 58%) and Cache tag and data decoding (by 41%). Therefore if the BLB information is found in the buffer, the second and third stages of the pipeline in either cache or IDC could be avoided by inserting bubbles in the pipeline. Even if the information needed is not in the BLB buffer, it could still be retrieved form the BLB before the third stages of either the IDC or Cache access happen. In this case, the output (3rd) stages in either cache or IDC could be gated.

6. Energy models 6.1. Cache access scenarios

Table 3 explains the terms used in this section. Based on the discussion in the previous section, one of the following 4 cases could happen when accessing the proposed cache architecture:

491

1) The BLB buffer hits and points to the cache. In this case, the 2nd and 3rd pipeline stages in IDC can be gated and disabled. This case is shown in figure 5.a. The green stages are the accessed stages and the white stages are the gated ones. Dynamic power consumption is estimated as follows:

( ) ( )02

2*

dd

dd

RcDidcbuff

V

VEEEE ++=

2) BLB buffer contains the required defect map data and it points to IDC. (Figure 5.b)

( ) ( )02

2*

dd

dd

DcRidcbuff

V

VEEEE ++=

3) BLB points to cache after BLB buffer miss. (Figure 6.a)

( ) ( )02

2*

dd

dd

RcRmidcblbbuffV

VEEEEE +++=

4) BLB points to IDC after BLB buffer miss. (Figure6.b)

VddV

RmcERidcEblbEbuffEEdd2

0

2*+++=

Figure 5: a.(left) BLB buffer points to cache W to B stand for word line to sense amp. b.(right): BLB buffer points to IDC

Figure 6: a. (left) BLB points to IDC

b. (right) BLB points to IDC

IDC size 64 Words

BLB buffer size 64 bit wide Voltage Simulation Range

Nominal to 35% lower voltages with 0.05 volt increments

32nm cycle time 95.9ps (75.5ns w-s , 20.4 sense amp) 65nm cycle time 326.3ps (250.2 w-s , 76.1 sense amp)

Cache size 16KB ( refer to table 1)

Bench Marks DEC A graphical decoder LISP Part of a game written in LISP MACR A graphic processing program PASC Accounting program in Pascal MUL Matrix-Matrix multiplication SPIC Spice simulation

Table2: Our simulation configuration

Due to temporal and spatial locality of access cases 1 and 2 are the dominant access types since most of the time the defect map info could be found in BLB buffer. Since these cases are the least dynamic power consuming, we expect reasonable reduction in dynamic power consumption. Although larger BLB buffers consume more dynamic energy, they increase the number of occurrences of cases 1 and 2. Through simulation the optimal size of the BLB buffer could be calculated from this trade off. The proposed architecture was simulated for 32 and 65 nm technologies. We used a trace driven pipelined cache architecture to validate our design (DineroIV). Six benchmarks (simulated for 1B instructions) are used in our simulation. Table 2 illustrates the setup of our simulations. For each simulation point a probability of failure per memory cell is obtained. Using that probability of failure we have created several defective caches and simulated the behavior of the proposed architecture. In order to evaluate the architecture one should consider both dynamic and leakage power/energy consumption. In the following, we develop models for both components.

=buffP Probability of defect map being in the buffer

=blbE Energy per BLB read.

=buffE Energy for decoding the buffer tag and masking the defect bit

=memE Energy of access to the lower level memories

=RcE Energy consumed to read cache

=WcE Energy of a write to cache

=RidcE Energy to read IDC

=WidcE Energy to write to IDC

=RmcE Energy to read cache when miss (no output driver activation)

=RmidcE Energy to read IDC when miss (no output driver activation)

=DcE Energy to decode cache

=DidcE Energy to decode IDC

=),( BAAorB A If case A happens, B if case B happens

=ddV Scaled memory supply voltage

=dd0V Nominal memory supply voltage

Table3: Terms and variables used 6.2. Dynamic cache energy model

Dynamic power is a function of the miss rate. As miss rate increases the dynamic power consumption increases because of the increased access to next level cache and main memory. Let’s consider a cache access scenario that results in a miss: The address is sent to BLB buffer, IDC and Cache. All three components

492

start decoding the address at the same time. If the address is not found in BLB buffer, then the BLB need to be accessed. Following one of the four cases explained in section V a miss will occur. Then the lower level memory is accessed for the data. When data is ready in the lower level cache, it is transferred to upper level either by writing to IDC and/or Cache. When data is updated in the upper level memory an interrupt is sent to the processor and the cache request is reinitiated. At this time the BLB, IDC and Cache are accessed again in parallel. Similarly to the initial cache access, one of the 4 access scenarios explained in section V will now find the data either in the cache or the IDC. We define the total energy per miss as:

),(

),(

),(

),(

),(

2

2

2

2

2

2

2

2

2

2

dd0

ddDcRidcDidc

dd0

ddRc

Widcdd0

ddWc

blbbuffbuffmem

dd0

ddDcRmidcDidc

dd0

ddRmc

blbbuffbuffmiss

VVEEE

VVEAorB

EVVEAorB

EEEAorBEVVEEE

VVEAorB

EEEAorBE

×+×+

+++

+×++×

++=

The AorB function dictates that only one of its fields will be involved in Energy equation. For example:

),( blbbuffbuff EEEAorB + indicates that if the defect map is in the buffer then the access energy is that of accessing the buffer, or Ebuff , and if it is in BLB both buffer and BLB access energies should be added together to get the total energy consumption. 6.3. Leakage power analysis

Leakage power does not depend on the miss rate. It is however dependent on the memory supply voltage. In process technologies of 65nm and below, the dominant leakage components is sub-threshold leakage. The sub-threshold leakage in MOSFET transistors relates to voltage by:

t

th

t

ds

dddd vnV

vV

tVVb

oxleakage eeveL

WCI .2)(0 ).1.(.... 0

∆−−− −= µ (1)

In this equation, vt = KT/q is the thermal voltage, W is the device width, Vds is the drain to source voltage, n is the DIBL coefficient, and oxC.0µ are constants that do not depend on temperature or voltage. The constant term b in the equation is a technology dependent coefficient which is obtained empirically. CACTI4.2 [11] assumes b = 1.7 in 65 nm technology. For 32nm, we ran SPICE simulations of a 6T memory cell using W/L ratio of 2, and choosing the widths of Pre-charge and Write NMOS transistors as 16X the size of memory cell NMOS width. Using linear regression on

the simulation results gave us an estimate of 2.2 for b in 65 nm technology and 3.6 for 32nm technology. In this equation everything is constant except W, Vdd, T and Vth. Therefore the equation could be written as:

),,(. thddlleakage VVTIWI = Using PTM V1.0 [13] and considering the cache to

be operating at 360 º K the total leakage of our proposed cache architecture at different voltage levels for 32 and 65 nm technologies is compared to the leakage of a traditional cache architecture and is illustrated in figure 7.

Figure 7: Percentage saving in the leakage power in proposed architecture compared with the traditional cache architecture

The leakage power includes the extra leakage

introduced by BLB and IDC cache. At nominal voltage the leakage saving compared to a traditional cache (with no IDC and BLB) is negative because of the additional leakage overhead from IDC cache and BLB. However, as we scale the voltage down, saving in the leakage of cache overcome the extra leakage expensed by BLB and IDC. Notice that only the voltage in the main cache is scaled while the BLB and IDC are kept at their nominal voltages. The leakage savings in the 32nm grows more rapidly that the 65nm technology due to two reasons: (1) the Vth used for simulation of cache in 32nm technology is higher that the Vth used for 65nm, and (2) the b factor in Equation (1) in 32nm technology is higher than in 65nm technology.

7. Simulation results 7.1. Simulation setup

Our trace driven (modified Dinero IV) [15] simulator was augmented with the analysis results obtained in the previous section and was enhanced with book keeping code to calculate the energy consumption based on the cache miss scenarios described before. In order to obtain a figure for dynamic energy consumption of

493

IDC, Cache and BLB we have used CACTI 4.2 [11]. Due to its very small size, the BLB buffer is more efficient to implement as a set of flip flops rather than an SRAM. The BLB buffer decoder energy consumption and BLB buffer power per read and write were obtained using Spice simulation. Simulated system configuration data are provided in table 2.

7.2. Benchmark simulation results

Figures 8 and 9 illustrate the percentage amount of energy saving for 65 and 32 nm technologies. When operating at nominal Vdd, the proposed cache containing additional blocks (IEC, BLB) will consume slightly more dynamic energy than a traditional cache. The savings in dynamic energy/power consumption increase when we reduce the supply voltage but only up to a certain point. Further lowering the Vdd increases the number of defective words beyond the tolerance of our purposed architecture and result in increase in miss rate and as a result the dynamic energy/power consumption increases. In addition to the voltage level, particular execution properties of each benchmark affect the extent of their energy/power saving. As stated previously the IDC is much smaller than the total number of defective words in the cache and therefore can not keep all the instructions that are mapped to defective words in the cache. Instead, the IDC only keeps those defective words that fall inside the window of execution. When the window of execution changes, the IDC should be updated with instructions mapped to defective words in the new window of execution. Based on this discussion we can infer that programs whose window of execution is rapidly changing (lots of jump and branches, and generally less local), or those that have a very large window (such that the number of defective words inside that window exceeds the IDC size), can not fully benefit from the proposed architecture. Such programs will result in lower energy savings. The LISP benchmark in Figures 8 and 9 represents a program with lots of small but long executing loops, small execution window and good locality. On the other hand MACR benchmark represents a program with fewer localities and rapidly changing execution window. Other benchmark execution properties fall between LISP and MACR. As we explained before our logic also has a lower bound on Vdd scaling. Therefore the lower bound on Vdd scaling is the maximum of the Decoder minimum voltage (DVdd), the Output stage logic minimum voltage(OVdd), and the Memory supply minimum voltage(MVdd). The above analysis applies

only to dynamic energy. In order to get a more realistic estimate of power savings, we need to combine the leakage and dynamic power/energy consumptions from which we can project the total energy savings. Due to the lack of space, and noting that other benchmarks follow the same trend we chose one representative benchmark (PASC) to further analyze. Figures 10 illustrate the contribution of each of dynamic and leakage energies to the total percentage of energy consumption savings for 32 and 65 nm technologies. Although in 65 nm technology the leakage energy consumption is much lower than the dynamic energy, both leakage and dynamic energy reduce with almost the same rate as the voltage reduces. In 32nm technology the opposite case holds; the leakage is the dominant energy consumption part as it drops faster with the voltage, therefore the total energy consumption savings are dominated by the leakage. Overall the energy savings in the PASC benchmark are 11% in 65nm and over 45% in 32nm.

Figure 8: Dynamic energy saving in 65nm

Figure 9: Dynamic energy saving in 32nm

A traditional way of dealing with faulty blocks is to

disable them if even one bit in the block is defective. We have compared the miss rate resulting from our architecture with to that of a traditional architecture across different voltages in 32nm technology. Table 4 summarizes this comparison for PASC benchmark. When voltage is reduced to 0.75V the traditional architecture suffers from 17.5% miss rate while our architecture suffers from only 3.92% miss rate. Note that a 3.9% miss rate is due to CCC (Compulsory,

494

Capacity and Conflict) misses, therefore in our case the miss rate has increased only by 0.02 % but the traditional system miss rate has increased by 13.6%. Tables 5 summarize the percentages reduction in total energy consumption for each benchmark across 65nm and 32nm technology. Note that due to the predominance of leakage energy in 32nm almost 3-4 times more energy savings is realized when compared to 65nm.

Percentage Saving in Energy

-40

-30

-20

-10

0

10

20

30

40

50

60

70

1 0.95 0.9 0.85 0.8 0.75

Voltage

Perc

enta

ge S

avin

g

Dynamic E 65nmLeakage E 65nmDynamic E 32nmLeakage E 32nm

Figure 10: PASC benchmark energy savings

Table 4: Comparison of Miss Rate in proposed architecture compared to traditional

block disabling architecture in 32nm

Table 5: Max power saving and min. Vdd 7.3. Area overhead analysis Using area figures from CACTI for the main cache, IEC and BLB, and synthesis results for the BLB buffer In 32 nm technology an additional 12.88% area is needed for BLB and IDC and 12.32% in 65 nm technology. Noting that our scheme can also be used to remedy manufacturing defects and therefore remove the need for Built-In Self Repair (BISR) (but not BIST), the actual penalty can be much less when compared to memories with BISR.

8. Conclusion

In this paper, we targeted caches for aggressive supply voltage scaling while maintaining the same

access time. We illustrated how power consumption is significantly reduced but access faults occurring at some memory locations reduce the effective cache size. We show that even with that, the overall power can be significantly reduced. The proposed approach can be extended in many directions, such as considering whole memory hierarchies. The impact of this technique becomes more significant with decreased process geometries.

9. Acknowledgement We gratefully acknowledge and appreciate the financial support offered by Center of Pervasive Communication and Computing (CPCC) and Samsung Corp. in driving this research forward.

10. References [1] A. Agarwal et al., “Exploring High Bandwidth Cashe

Architecture for Scaled Technology” . Proc. DATE’03 [2] P. P. Shirvani and E. J. McCluskey, ‘‘PADded Cache: A New

Fault- Tolerance Technique for Cache Memories’’, In Proc. of 17th IEEE VLSI Test Symposium, pp.440-445, April 1999.

[3] T. Ishihara, F. Fallah, “A Cache Defect Aware Code Placement Algorithm for improving the performance of Processes.”.

[4] H. T. Vergos and D. Nikolos, ‘‘Efficient Fault Tolerant Cache Memory Design’’, Micro processing and Microprogramming Journal, vol.41, no.2, pp.153-169, 1995.

[5] M. A. Lucente, et. al., ‘‘Memory System Reliability Improvement Through Associative Cache Redundancy’’, Proc. of CICC’90, pp.19.6.1-19.6.4,

[6] G. Sohi, ‘‘Cache Memory Organization to Enhance the Yield of High Performance VLSI Processors’’, IEEE Trans. Comp., vol.38(4), , pp.484-492, April 1989.

[7] A. Pour and M. Hill, ‘‘Performance Implications of Tolerating Cache Faults’’, IEEE Trans. Comp, (42)3, pp.257-267, 1993.

[8] X. Luo and J. C. Muzio, ‘‘A Fault-Tolerant Multiprocessor Cache Memory’’, Proc. IEEE Workshop on Memory Technology, Design and Testing, pp.52-57, August 1994.

[9] Y. Ooi, et al. ‘‘Fault- Tolerant Architecture in a Cache Memory Control LSI’’, JSSC, vol.27, no.4, pp.507-514, April 1992.

[10] A. Argawal, et al., ‘‘Process variation in Embedded Memories: Failure Analysis and Variation Aware Architecture. IEEE JSSC. Vol. 40(9) 2005

[11] http://www.hpl.hp.com/personal/Norman_Jouppi/cacti4.html [12] B. Zahi,et al. ‘‘The Limit of Dynamic Voltage Scaling and

Insomnic Dynamic Voltage Scaling.’’ IEEE Trans. VLSI, VOL. 13(11), 2005

[13] http://www.eas.asu.edu/~ptm [14] http://www-device.eecs.berkeley.edu/~bsim3/latenews.html [15] http://www.cs.wisc.edu/~markhill/DineroIV/ [16] H. Mahmoodi, at al.. ‘‘Modeling of failure probability and

statistical design of sram array for yield enhancement in nano-scaled cmos,’’ IEEE Trans. CAD , 2003

[17] A. Bhavnagarwala et. Al.. ‘‘The impact of intrinsic device fluctuations on CMOS SRAM cell stability,’’ JSSC,April 2001.

[18] http://www.itrs.net/Links/2005ITRS/Home2005.htm [19] A. Djahromi, et. al, “Cross Layer Error Exploitation for

Aggressive Voltage Scaling,” ISQED 2007. [20] P. Friedberg, et. al.. “Modeling within-die spatial correlation

effects for process-design co-optimization” Proc. ISQED 005

Voltage Proposed architecture Traditional 0.7 6.45% 59.5%

0.75 3.92% 17.5% 0.8 3.90% 4.10%

0.85 3.90% 3.90% 0.9 3.90% 3.90%

32nm 65nm Bench Mark Min

Voltage Max Power

Saving Min Voltage Max Power

Saving DEC 0.75 43.28% 0.84 10.23% LISP 0.72 49.38% 0.77 18.36% MACR 0.75 44.32% 0.83 12.21% PASC 0.74 45.60% 0.83 11.90% MUL 0.75 43.12% 0.84 7.93% SPIC 0.72 44.73% 0.77 15.21%

495