4
Towards optimal CMOS lifetime via unified reliability modeling and multi-objective optimization Agathoklis Papadopoulos, Theocharis Theocharides, Maria K. Michael KIOS Research Center Department of Electrical & Computer Engineering, University of Cyprus Nicosia, Cyprus Abstract— Reliability of CMOS devices emerges as a vital design constraint, evidenced by several CMOS failure mechanisms. Such mechanisms have traditionally been modeled independently, using statistical approximation techniques to estimate Mean-Time-to-Failure (MTTF) rates. This paper proposes a unified framework that integrates the existing failure models into a multi-objective optimization engine, in an attempt to provide a pareto-optimal solution indicating the suggested operating conditions of a system for a given technology and size (in transistors), in an effort to maximize its lifetime reliability. In addition to the existing failure mechanisms, the framework also considers a proposed system-level leakage power estimation model, as leakage is interdependent on temperature, and as such impacts system reliability. The framework can be used in several design scenarios, such as thermal-aware task scheduling. I. INTRODUCTION Nanoscale fabrication technologies push CMOS devices to their limits, and as a result, designers must take into account physical phenomena neglected in earlier technologies. In the realm of the nanometer era, the possibility of manufacturing defects is amplified by mechanisms present in the CMOS physical structure. Such mechanisms can accumulate enough circuit damage that impacts its lifetime and reliability expectations. Several attempts have been made to model the lifetime and reliability expectations of CMOS devices by using statistical estimation techniques. However, the majority of these attempts treat each mechanism independently from each other, developing individual models for each failure mechanism. Moreover, the majority of the models have been used to estimate the worst case scenario [1,2]. While reliability in CMOS circuits has traditionally been assured by assuming worst case operational conditions, this assumption is rather pessimistic. CMOS circuits typically operate in significant gaps from their design-time estimated operational conditions. A more optimistic approach, that can potentially utilize existing statistical MTTF models into a unified optimization framework that can return the pareto-optimal operating conditions of a system in a way that maximizes its estimated lifetime, can be very beneficial to designers. Furthermore, an overlooked but significant factor impacting circuit reliability is leakage power and its impact on CMOS devices; leakage power is critical in accurate lifetime expectancy estimation, as most of the lifetime degradation mechanisms are, like leakage, temperature dependent. This paper proposes a unified lifetime and operational conditions modeling framework based on multi-objective optimization techniques. The framework can be used during early design-time to help designers estimate and optimize the operational conditions and design parameters of a system. The framework utilizes statistical models describing the failure mechanisms of Electromigration, Stress Migration, Gate Oxide Breakdown, Hot Carrier Injection and Thermal Cycling, in order to create a unified lifetime estimation framework. These models are commonly accepted and have been used in several related works [1-5]. The proposed framework can be updated with emerging models as well. Moreover, a simplified system-level version of leakage current estimation and its impact on reliability is derived, so that it can be integrated into the proposed framework. The paper presents some related work in Section II, and a brief summary of the adopted models in Section III. Section IV presents the proposed leakage current estimation model, and Section V presents the optimization framework and some example applications. Conclusions are provided in Section VI. II. RELATED WORK Several works have emerged in recent years, which propose the use of statistical reliability models in an effort to allow the systems to operate on their typical operating conditions rather than the worst-case scenario which has been traditionally adopted by designers. The majority of these works focuses on implementing dynamic reliability management (DRM) techniques. [1] showed that dynamic voltage scaling is an effective response technique for DRM in order to reduce costs and improve performance of multicore systems. [2] proposes the use of a proportional-integral- derivative (PID) controller based on a DRM mechanism, and [3] attempts to balance the trade-off between power consumption and reliability in multicore systems. A lifetime reliability aware task-allocation for multicore systems has This work was supported in part by the Cyprus Research Promotion Foundation 978-1-4244-9474-3/11/$26.00 ©2011 IEEE 1049

[IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

  • Upload
    maria-k

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

Towards optimal CMOS lifetime via unified reliability modeling and multi-objective optimization

Agathoklis Papadopoulos, Theocharis Theocharides, Maria K. Michael KIOS Research Center

Department of Electrical & Computer Engineering, University of Cyprus Nicosia, Cyprus

Abstract— Reliability of CMOS devices emerges as a vital design constraint, evidenced by several CMOS failure mechanisms. Such mechanisms have traditionally been modeled independently, using statistical approximation techniques to estimate Mean-Time-to-Failure (MTTF) rates. This paper proposes a unified framework that integrates the existing failure models into a multi-objective optimization engine, in an attempt to provide a pareto-optimal solution indicating the suggested operating conditions of a system for a given technology and size (in transistors), in an effort to maximize its lifetime reliability. In addition to the existing failure mechanisms, the framework also considers a proposed system-level leakage power estimation model, as leakage is interdependent on temperature, and as such impacts system reliability. The framework can be used in several design scenarios, such as thermal-aware task scheduling.

I. INTRODUCTION Nanoscale fabrication technologies push CMOS devices to

their limits, and as a result, designers must take into account physical phenomena neglected in earlier technologies. In the realm of the nanometer era, the possibility of manufacturing defects is amplified by mechanisms present in the CMOS physical structure. Such mechanisms can accumulate enough circuit damage that impacts its lifetime and reliability expectations. Several attempts have been made to model the lifetime and reliability expectations of CMOS devices by using statistical estimation techniques. However, the majority of these attempts treat each mechanism independently from each other, developing individual models for each failure mechanism. Moreover, the majority of the models have been used to estimate the worst case scenario [1,2]. While reliability in CMOS circuits has traditionally been assured by assuming worst case operational conditions, this assumption is rather pessimistic. CMOS circuits typically operate in significant gaps from their design-time estimated operational conditions. A more optimistic approach, that can potentially utilize existing statistical MTTF models into a unified optimization framework that can return the pareto-optimal operating conditions of a system in a way that maximizes its estimated lifetime, can be very beneficial to designers. Furthermore, an overlooked but significant factor impacting circuit reliability is

leakage power and its impact on CMOS devices; leakage power is critical in accurate lifetime expectancy estimation, as most of the lifetime degradation mechanisms are, like leakage, temperature dependent.

This paper proposes a unified lifetime and operational conditions modeling framework based on multi-objective optimization techniques. The framework can be used during early design-time to help designers estimate and optimize the operational conditions and design parameters of a system. The framework utilizes statistical models describing the failure mechanisms of Electromigration, Stress Migration, Gate Oxide Breakdown, Hot Carrier Injection and Thermal Cycling, in order to create a unified lifetime estimation framework. These models are commonly accepted and have been used in several related works [1-5]. The proposed framework can be updated with emerging models as well. Moreover, a simplified system-level version of leakage current estimation and its impact on reliability is derived, so that it can be integrated into the proposed framework. The paper presents some related work in Section II, and a brief summary of the adopted models in Section III. Section IV presents the proposed leakage current estimation model, and Section V presents the optimization framework and some example applications. Conclusions are provided in Section VI.

II. RELATED WORK Several works have emerged in recent years, which

propose the use of statistical reliability models in an effort to allow the systems to operate on their typical operating conditions rather than the worst-case scenario which has been traditionally adopted by designers. The majority of these works focuses on implementing dynamic reliability management (DRM) techniques. [1] showed that dynamic voltage scaling is an effective response technique for DRM in order to reduce costs and improve performance of multicore systems. [2] proposes the use of a proportional-integral-derivative (PID) controller based on a DRM mechanism, and [3] attempts to balance the trade-off between power consumption and reliability in multicore systems. A lifetime reliability aware task-allocation for multicore systems has

This work was supported in part by the Cyprus Research Promotion Foundation

978-1-4244-9474-3/11/$26.00 ©2011 IEEE 1049

Page 2: [IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

been proposed in [4], and an attempt to study the impact of leakage current on electromigration effects on lifetime has been made in [5]. These works however, only consider a couple of failure mechanisms concurrently. Moreover, as temperature impacts the lifetime of CMOS devices, and leakage power and temperature are interdependent on each other, leakage power must be taken into consideration when estimating lifetime reliability.

In this work, we merge several failure mechanisms into a unified reliability model, including a simplified leakage power impact model, in an attempt to utilize a multi-objective optimization framework and yield the best possible operating conditions and design parameters. The proposed framework can assist designers in early design stages, by providing pareto-optimal operating conditions for a certain circuit and its technology parameters.

III. PHYSICAL MODELS AND LIFETIME ESTIMATION

A. Lifetime and Reliability Estimation Reliability is commonly described by the Mean Time to

Failure (MTTF), statistically approximated using well-known mathematical functions that stem from each of the failure mechanisms observed in CMOS circuits. In this work, we assume constant failure rates for the failure mechanisms described [12]. While the assumption simplifies the reliability analysis, it can still be used in order to evaluate the proposed optimization framework. Assuming that the failure mechanisms race each other until the device malfunctions [12], the overall reliability function and MTTF can be calculated by: 1∑ 1 (1)

B. Physical Failure Models Failure mechanisms affecting CMOS reliability vary

depending on several factors, and the statistical models describing each one depend on a large number of parameters. We briefly describe each model next; for a more detailed description, we refer the reader to the appropriate references.

Electromigration (EM) is the transport of material due to momentum transfer between conducting electrons and diffusing metal atoms in a conductor under high direct current density. Damage occurs when enough metal ions are moved away, causing high resistance and eventually open circuit conditions, thus leading to circuit failure. The accepted model for MTTF due to EM is the Black’s Model [1,2,6,7] and it depends on the current density of the line. As current density is related to the switching activity of the line [8], the model can be express as: ·· · · · (2)

Stress Migration is a failure mechanism that often occurs in IC metallization. Essentially, is a form of movement of metal atoms under the influence of stress occurred when the stress exceeds the yield-point of the metal interconnect. This

movement can cause voids within the structure of the metal interconnect. Large voids may lead to open circuit or unacceptable resistance increase, resulting circuit failure. The accepted model for MTTF due to thermo-mechanical stress migration follows the Eyring equation [1][7] and is given by:

· (3)

Oxide Breakdown refers to the destruction of the oxide layer (usually silicon dioxide) which serves as the dielectric between the gate metal and the semiconductor of a MOS transistor. The strong electric fields across the layer can cause damage to the isolating properties of the oxide, having as a result the complete failure of the oxide followed by circuit failure. As proposed in [1], in the case of ultra-thin oxides (<5nm), the empirically derived model that fits the experimental data is:

· 1 · eX YT Z·TT (4)

Hot carrier injection (HCI) occurs when a charge carrier gains sufficient kinetic energy to overcome the potential barrier necessary to break through the gate oxide. Those carriers are injected into the oxide, and eventually cause shifting of the device’s performance. HCI’s device degradation effects are more important at low temperature. There are two available models, one for nMOS and one for pMOS devices. Those models make the practical assumption that the variables are independent between each other. The activation energy for this phenomenon is negative; as such, the lifetime degradation decrease as temperature rises.

1) n-channel model · · · (5)

2) p-channel model · · · (6)

Thermal Cycling (TC) - increasing/decreasing temperatures in repeated manner - causes molecular reorganization of materials. TC effects are present to CMOS devices as well, due to repeated power on/off of the transistors. Temperature cycles accumulate damage to the materials of the circuit and eventually cause destruction of the circuit [2][7]. In most cases the effects of TC are expressed by a modified Coffin-Manson equation [7]: · ∆ (7)

In addition to the aforementioned models, two more factors also impact the reliability; Negative Bias Temperature Instability (NBTI) and ionic corrosion, which have also been modeled; these models however are left to be included in the complete version of the proposed framework.

We use (1) to unify the models and obtain an integrated model that we then use in conjunction with a proposed leakage power model and its impact on reliability in order to explore the best operating conditions and the combination of design parameters. We explain the leakage power model next.

1050

Page 3: [IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

(a) (b)

Figure 1. Proposed Framework Outline (a) and Workflow (b)

IV. SYSTEM-LEVEL CMOS LEAKAGE MODEL The power consumption of CMOS devices can be

calculated as the sum of dynamic and static power [9]. In our proposed framework, power plays a dual role - it is given as a system constraint, and as part of the output, since leakage power impacts the temperature which in turn impacts lifetime. Moreover, as already said, leakage power and temperature are interdependent. While the model describing dynamic power is easy to be used, existing leakage power models are more complex due to leakage causes existing in the physical structure of a transistor, hence the designers are willing to compensate with simpler models [9]. The leakage current can be quite accurately expressed by considering the subthreshold current, Isub , which is a strong function of temperature and can lead to thermal runaway [9].

MOSFET spice models that take subthreshold leakage current into account – like BSIM [10] – have been developed over the years. We adopt a simplified version of the BSIM model that was constructed using a similar approach as the one proposed by [11]. This model compensates the need for an early design-time model tool for architects by estimating with enough accuracy the subthreshold leakage current, by applying certain circuit level assumptions (same-sized devices, relatively equal number of PMOS and NMOS):

(8) where K1,b and n are model parameters, W is the gate width and Vdd0 is the default supply voltage for the selected technology. Vdd and Vth are the supply and threshold voltage, respectively. The results from this simple model have been validated using the results stemming from [13].

V. OPTIMIZATION FRAMEWORK & RESULTS The optimization framework introduced in this work,

operates with the purpose of finding pareto-optimal operational conditions for the IC design in order to keep the design between the device’s functional boundaries. The framework receives as input the technology parameters of the IC (including estimated number of transistors) which are used to evaluate the failure mechanisms of sections III and IV, and the expected operational constraints (operational frequency, anticipated power consumption, temperature range and expected life reliability), and returns the pareto-optimal operating conditions and the necessary operational parameters (Vth, Vdd and frequency f). Given the problem's contradictory constraints (power, performance, reliability), the optimization framework may yield optimal results for one objective (i.e. reliability) but may not be quite as good for energy and performance. Instead, the framework returns the pareto-optimal solutions. In the absence of further information, the choice of which solution will be used is left to the designer.

As a starting optimization solver, we selected the NSGA-II evolutionary algorithm [14]. The selection of a multi-objective evolutionary algorithm (MOEA) provides the ability to produce multiple pareto-optimal solutions in a single simulation run. A classical optimization method is at disadvantage in this case, due to the increased simulation time necessary, as classical methods usually convert the given

multi-objective problem to single-objective, and focus at one particular pareto-optimal solution at each execution. For multiple solutions, those methods have to be applied multiple times and hopefully find a different solution at each simulation run. Other algorithms that produce multiple pareto-optimal solutions in a single run, like multi-objective simulated annealing, are under evaluation and conclusions about them are left to be presented in future work.

The framework was implemented in MATLAB and is based on the mathematical formulae used to describe the multi-objective problem defined by the models given in Table I. The lifetime estimation is computed using the models described in Section III and the leakage model in Section IV. An outline of the framework is given in Fig.1.

The proposed framework can be used as an early design cycle decision-making tool for circuits. To illustrate its usability, an example test circuit defined by the parameters

TABLE I. PROPOSED FRAMEWORK’S MATHEMATICAL FORMS

TABLE II. MULTI-OBJECTIVE PROBLEM TEST CASE

1051

Page 4: [IEEE 2011 IEEE International Symposium on Circuits and Systems (ISCAS) - Rio de Janeiro, Brazil (2011.05.15-2011.05.18)] 2011 IEEE International Symposium of Circuits and Systems

Figure 2. Use of framework results in task schedulers

shown in Table II, extracted from [14-16] is considered. The resulting pareto-optimal solution set is shown in Fig.3, expressed as a set of 3D plots for readability purposes. Results are shown for operating conditions defining leakage and active power, and expected lifetime reliability under the supplied constraints and technology parameters. To understand the usability of the solution set, let us assume that the test circuit will be integrated in a handheld device (i.e. low temperature). As such, for the given device specifications, desirable operating parameters from the pareto-optimal set for highest performance would be f=530MHz, Vth=0.1335V with an operational temperature of 311K (~38oC). The solution is circled on each of the three plots of Fig.3. A higher frequency would not be recommended as it would increase the operating temperature, violating the specifications. The proposed framework can therefore provide designers with early feedback, reducing the design time.

The proposed framework can alternatively be used in other design exploration stages, such as optimizing task scheduling on a processor. Given the framework’s extracted pareto-optimal set of operational conditions that include the desirable temperature Topt, thermal-aware task schedulers can select the schedule which yields an average temperature closer to Topt. We demonstrate this in Fig.2, by showing the thermal behavior of two task schedules on the Alpha 21364 processor extracted from [4,15] compared to the pareto-optimal solution set extracted from Alpha’s operating conditions. It is evident that Schedule2 is better than Schedule1, operating close to the pareto-optimal conditions.

VI. CONCLUSION & FUTURE WORK This paper proposes the use of a multi-objective

optimization framework that can help system designers to select the appropriate design parameters and operational conditions for their design, by taking into account the expected lifetime, power, performance and technology constrains of the design. The framework integrates existing failure mechanism models, and produces a pareto-optimal solution set that can guide the designer in early design stages. On-going and future work includes expansion of the reliability estimation algorithm to include wear-out phase calculations, as well as integration of NBTI and corrosion failure mechanisms. We are also exploring the use of other multi-objective optimization solvers such as simulated annealing, in an effort to optimize the performance and generality of the

framework.

REFERENCES [1] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "The case for lifetime

reliability-aware microprocessors," ACM SIGARCH Computer Architecture News, vol. 32, 2004.

[2] E. Karl, D. Blaauw, D. Sylvester, and T. Mudge, "Reliability modeling and management in dynamic microprocessor-based systems," 43rd ACM/IEEE Design Automation Conf., 2006, pp. 1057-1060.

[3] K. Waldschmidt, J. Haase, A. Hofmann, M. Damm, and D. Hauser, "Reliability-aware power management of multi-core Systems (MPSoCs)," Dynamically Recnfigurable Architectures, Springer, 2006.

[4] L. Huang, F. Yuan, and Q. Xu, "Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms," DATE, 2009.

[5] S. Lin et al., "Impact of off-state leakage current on electromigration design rules for nanometer scale CMOS technologies," IEEE Int. Reliability Physics Symposium; 1999, 2004, p. 74–78.

[6] Y. Shiyanovskii, F. Wolff, C. Papachristou, D. Weyer, and W. Clay, "Exploiting Semiconductor Properties for Hardware Trojans," ACM CoRR, 2009.

[7] JEP122-B, Failure Mechanisms and Models for Semiconductor Devices, JEDEC PUBLICATION, 2003.

[8] A. Dasgupta and R. Karri, "Electromigration reliability enhancement via bus activity distribution," 33rd Design Automation Conf., 1996.

[9] N. Kim et al., "Leakage Current: Moore's Law meets static power," IEEE Computer Society - Computer, vol. 36, 2003.

[10] BSIM Research Group, http://www-device.eecs.berkeley.edu/~bsim3/. [11] J. Butts and G. Sohi, "A static power model for architects," 33rd

IEEE/ACM International Symposium on Microarchitecture, 2000. [12] Springer Handbook of Engineering Statistics, Springer, 2006. [13] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan,

"HotLeakage : A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects," Science, 2003.

[14] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist multiobjective genetic algorithm: NSGA-II," IEEE Trans. on Evolutionary Computation, vol. 6, 2002, pp. 182-197.

[15] K. Skadron et al., "Temperature-aware microarchitecture," ISCA, 2003. [16] Int. Technology Roadmap for Semiconductors, http://www.itrs.net.

Figure 3. Pareto Optimal Solutions per Objective

1052