[IEEE 2013 IEEE International Reliability Physics Symposium (IRPS) - Anaheim, CA (2013.4.14-2013.4.18)] 2013 IEEE International Reliability Physics Symposium (IRPS) - Advances in industrial

Advances in industrial practices for optimal performance/reliability/power trade-off in commercial high-

performance microprocessors for wireless applicationsV. Huard, F. Cacho, L. Claramond, P. Alves, W. Dalkowski, D. Jacquet

STMicroelectronics 850 rue jean monnet 38926 Crolles, FRANCE Corresponding author: [email protected]

S. Lecomte, M. Tan, B. Delemer, A. Kamoun, V. Fraisse

ST-Ericsson 12 rue Jules Horowitz, 38019 Grenoble, FRANCE

Corresponding author: [email protected]

Abstract— This paper deals with the challenge of optimizing the performance/reliability/power trade-off in commercial high-performance microprocessors for wireless applications in advanced CMOS nodes. Both the increased impact of electrical reliability degradation and an increased thermal runaway risk require a dedicated approach combining product engineering and high-level modeling approach to achieve optimal reliability guardband determination even in the case of numerous, discrete V-F operating modes and their related mission profiles.

Keywords- reliability, microprocessor, wireless application, guardband, product qualification, gate-level models, aged models, DVFS, AVS, failure rate

I. INTRODUCTION Reliability issues are becoming increasingly difficult to optimize at advanced nodes only at process level due to diverse device offerings, complex products and overdrive requirements. Design-in Reliability (DiR) methodologies provide a quantitative assessment of reliability – CMOS device reliability in this case – at design stage thereby enabling judicious margins to be taken beforehand. A good discussion on DiR approach can be found in [1] and it typically has addressed degradation due to Hot-Carrier Injection (HCI). Of late, Negative Bias Temperature Instability (NBTI) degradation of p-channel transistors has emerged as a prominent degradation mechanism and its impact on circuits is increasingly being discussed [2-4]. Finally, the introduction of HiK and metal gates was accompanied with Positive BTI in more recent nodes. To account for this operation-related performance loss, products have been designed using guard-banding methodology [5]. Furthermore, pessimistic guard bands are also used for high-speed product classification to guarantee reliable functionality over the entire product lifetime [6-10].

In this paper, we describe, for the first time, a new methodology to handle reliability guard banding in commercial high-performance microprocessors for wireless applications. This approach is aimed to remove pessimism by combining both silicon measurements and high-level modeling to describe the whole Safe Operating Area in which the product can reliably operate. It is aimed to work not only for single operating mode microprocessors but also for the more complex case where several discrete Voltage-Frequency operating modes are used for power savings purposes. The new methodology allows evaluating accurately the reliability

margin needed to tackle the impact of various parameters of the product usage as well as evaluating the competition between reliability and thermal runaway risk.

II. NEED FOR ADVANCES IN PRODUCT QUALIFICATION With the ever increasing demand for high performance

from handheld devices, comes the realization that high performance is not without cost. The cost of high performance is power consumption. At the same time consumers are demanding high performance and long battery life. Traditionally, these things have been at odds and generally speaking a choice had to be made between performance and battery life. Today’s embedded processors in wireless applications are equipped with power saving features that try to save power while at the same time deliver high performance.

Current reliability qualification and product guard banding approach are based on worst-case temperature and utilization; however, delivering highest computational performance with the greatest energy efficiency while maintaining acceptable yield and reliability is the nowadays challenge for microprocessors. Both CPU’s performance and power limits are now constrained by reliability mechanisms. Though many efforts have been already deployed to generate built-in reliable designs, once in production both the manufacturing and test flows can allow additional ways for fine-grained tuning to reach optimized performance/reliability/power trade-off. The number of parameters’ dimensions is very high in this “what-if”-like analysis.

One method that is increasingly being used to reduce the power consumption, while maintaining high performance is Dynamic Voltage and Frequency Scaling hereafter referred to as “DVFS”. The key idea behind DVFS is to scale the voltage and frequency of the processor to provide “just-enough” circuit speed to process the system workload while meeting the total compute time and/or throughput constraints, and thereby, reducing the energy dissipation. Since the energy dissipated per cycle with CMOS circuitry scales quadratically to the supply voltage, DVFS can potentially provide significant energy savings.

As a consequence, through DVFS use, embedded microprocessors operate at various discrete Voltage-Frequency operating modes depending on the workload which also results in different junction temperatures. On top of that, the use of Adaptive Voltage Scaling (AVS) methodology allows every

978-1-4799-0113-5/13/$31.00 ©2013 IEEE 3E.5.1

single chip to get its own voltage value for all IP operation modes. Overall, it is impossible to realistically address all these operating modes during product qualification exercise where only few of them at best will be experienced on silicon. Nevertheless, all these situations will be met in the field and need to be guaranteed as well, which is only feasible through adequate modeling approach. After introducing the basics of our modeling framework, the rest of the paper will deal first with the product qualification strategies and in a second time the impacts of various context changes (mission profile, maximum temperature…) on performance/ reliability/power trade-off. Dual-core application processors, designed in both 45nm SiON node and 32nm Hi-K node, were measured/stressed on silicon, serving as testcase for this study.

III. FAILURE DEFINITION IN DIGITAL CIRCUITS In synchronous digital circuits (either globally or locally

synchronized), the frequency is forced through a controlled oscillator (as PLL) at the required frequency for the operation mode, generally to achieve targeted throughput. A minimum voltage (Vmin) can be defined as the minimum voltage for which the corresponding pattern is working properly. It results, for a fixed frequency, a minimum voltage yield curve for a population of microprocessors (cf. figure 1). As expected, to sustain increased frequency, the minimum voltage is also increased due to required transistors speedup to follow PLL.

On another hand, even in presence of high accuracy voltage regulator on application boards, mechanisms as static and dynamic voltage drops or HF ripple generate supply voltage fluctuations internally in the microprocessor which translate into a range of supply voltage for which the microprocessor functionality must be guaranteed (cf. figure 1).

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00%

Supp

ly v

olta

ge

Application processor yield (%)

Vmax

Voltage range

Vmin

Frequency 1

Frequency 2

Frequency 3

VnomSupply accuracyStatic/Dynamic drop

HF ripple

Figure 1: Application processor Vmin yield curves for three different PLL frequencies for speed indicative functional pattern in 45nm node. Voltage range to be guaranteed is mentioned using minimum and maximum bounds. It is worth noticing that voltage range might depend on the targeted frequency.

For a given PLL frequency, the minimum voltage yield curve is shifting upwards after at-speed HTOL tests using functional patterns, as a direct consequence of transistors slow-down induced by electrical reliability degradation modes. After some stress time, some yield loss might be expected at the minimum bound of the guaranteed voltage range. In real operating field mode, this configuration will translate into a

potential pattern failure with a loss of user experience. This configuration of functionality loss will be referred after in this paper as failure rate (cf. figure 2).

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00%

Supp

ly v

olta

ge

Application processor yield (%)

Vmax

Voltage range

Vmin

Failure rate

Frequency 2

HTOL time

VnomSupply accuracyStatic/Dynamic drop

HF ripple

Figure 2: Application processor Vmin yield curves for fresh (no ageing) and two increasing stress times. Aged yield curves were measured following at-speed HTOL tests using functional patterns. A failure in digital circuits is defined as the yield loss at the minimum voltage that should be guaranteed.

IV. MODELING FRAMEWORK The failure rate calculation thus requires the knowledge of

the Vmin yield curve before and after any kind of electrical stress to allow comparison with targeted voltage range. In our modeling framework, every digital circuit is considered as a collection of independent speed-limiting critical paths. In state-of-the-art designs, these critical paths are configured to get a small timing margin (slack) with respect to timing target resulting in tight logistics-like distribution [11]. The critical path distribution is resulting from design implementation (slack distribution) and gate-level modeling of die-to-die and within-die variations. The critical path distribution modifications with respect to electrical reliability degradation modes such as BTI and HCI modes is introduced accordingly to the ageing gate-level modeling approach, as described in [5]. The oxide breakdown (TDDB) mode is also considered through the oxide area evaluation based on an automated product design database analysis. Based upon these critical paths distributions and sensitivity analysis, and integrating high-level information on the design in conjunction with mission profile allows a hierarchical modeling of the Vmin yield curves before and after any kind of mission profile (cf. figure 3).

TECHNOLOGYD2D & WID Variations Reliability modelsTECHNOLOGY

TimeVoltageDESIGN

Critical Path Collection

Gate-level Modeling

Hierarchical ModelingDigital Design

1- Core type2- Number of cores3- Number of CPs/Core4- Sign-off corners5- Mission profile6- Voltage stacks

Safe Operating Area

Frequency

Volta

ge

SPICE models

AgeingWIDD2D

Path delay

D2D&WID/Ageing Path Delay Impact

Figure 3: Modeling framework flow for failure rate SOA determination

3E.5.2

From this framework, both fresh (no ageing) and aged (following at-speed HTOL stress or in-field predictions) yield curves can be modeled according to the approach in [5].

Figure 4 (a) shows how the modeled Vmin yield curves compare well with silicon measured ones on the dual-core application processor processed in 45nm SiON node. PLL frequency is fixed in this case at 1GHz and performance indicative sequence functional pattern is run. From the knowledge of Vmin yield curves (both fresh and aged) and their comparison with the targeted voltage stacks, it is possible to model first the yield curve for varying nominal voltage and PLL frequency (cf. figure 4 (b)). In this graph, the yielding conditions are described by the blue hues.

By comparing fresh (no ageing) and aged (following at-speed HTOL stress) yield curves, it is possible to extract the failure rate, defined as the yield loss for every single operating point (voltage and frequency).

Extending this analysis to several voltage/frequency operating modes (used as at-speed HTOL stress here) allows drawing the failure rate Safe Operating Area (SOA) for a given application processor (cf. figure 4 c) following a Design-Of-Experiments approach for missing points. In this graph, every color change from blue to red refers to one decade increase of the failure rate and the area described by the blue hues refers to the Safe Operating Area for application processor design under consideration. The failure rate SOA obtained through the proposed modeling framework (cf. figure 4 (d)) shows excellent agreement with respect to experimental results.

Similar results have been obtained also on the dual-core application processor processed in 32nm Hi-K node. Both median Vmin values and median total current consumption have been modeled for the fresh situation (no ageing) (lines in figure 5 top and middle respectively) for various operating conditions including voltage, PLL frequency and body biasing conditions. Figure 5 (bottom) shows that the modeled Vmin yield curves compare well with silicon measured ones on the dual-core application core processed in 32nm HiK node.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

% p

rodu

cts

wor

king

at 1

GH

z

Nominal Voltage (a.u.)

FRESH

AGED

Frequency (GHz)

Nom

inal

Vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Frequency (GHz)

Nom

inal

Vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Frequency (GHz)

Nom

inal

Vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Figure 4: a: Experimental Vmin yield curves (symbols) for both fresh (black) and stressed (red) configuration for the 45nm dual-core application processor running a performance indicative sequence functional pattern at 1GHz. Yield curves resulting from modeling are shown as lines. Failure rate is defined as the yield loss at any voltage for a frequency. b: Dual-core application processor in 45nm SiON node modeled Vmin yield curves for fresh (no ageing) situation going from zero (red) to 100% yield (blue). c: Failure rate SOA built with Voltage-Frequency operating modes used as HTOL stress conditions. Every color change indicates one decade failure rate increase. d: Failure rate SOA from modelling framework showing good agreement with experiments.

Overall, we have developed and validated on various conditions and technologies a modeling framework which allows evaluating accurately digital designs performances and failure rates. This framework opens the way for modeling-assisted product qualification procedure.

3E.5.3

0.5 1 1.5 2 2.5

Nom

inal

vol

tage

(a.u

.)

PLL frequency (GHz)

No Body Bias

Forward Body Bias

0.5 1 1.5 2 2.5

Tota

l cur

rent

(a.u

.)

PLL frequency (GHz)

No Body Bias

Forward Body Bias

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

% p

rodu

cts

wor

king

at 1

.85

GH

z

Nominal Voltage (a.u.)

FRESHAGED

Figure 5: Top: Dual-core application processor in 32nm Hi-K node modeled median Vmin values for fresh situation (lines) for various operating points including voltage, frequency and body biasing conditions. Good agreement is observed with respect to experimental measurements (symbols) based on power-indicative functional pattern. Middle: Dual-core application processor in 32nm Hi-K node modeled median total current consumption values for fresh situation (lines) for various operating points including voltage, frequency and body biasing conditions. Good agreement is observed with respect to experimental measurements (symbols) based on power-indicative functional pattern, nominal voltage being fixed in this measurement. Bottom: Experimental (symbols) and modeled (lines) yield curves for performance indicative sequence for both fresh (black) and stressed (red) configuration for the 32nm dual-core application processor running at 1.85GHz.

V. AGEING MARGIN DEFINITION AND DETERMINATION In order to assist product qualification and to define

adequate ageing margin, it is important to understand that the voltage stack and the Vmin yield curve play a dominant role in the failure rate calculation. The classical way to reduce failure rate is to apply an ageing margin in the bottom part of the voltage stack (i.e. board-level voltage remains constant while minimum transistor-level is swept downwards). This ageing margin allows binning for good parts in the fresh distribution (figure 6 top and middle), which in turn reduces the overall yield. Nevertheless, it also turns out to reduce the failure rate (figure 6 bottom).

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

% p

rodu

cts

wor

king

at 1

GH

zApplication processor voltage (a.u.)

Failure rate

Fresh Aged

Vmin Vnom

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

% p

rodu

cts

wor

king

at 1

GH

z

Application processor voltage (a.u.)

w/ binning w/o binning

Ageing Margin

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

% p

rodu

cts

wor

king

at 1

GH

z

Application processor voltage (a.u.)

w/ binning

w/o binning

Reduced failure rate

Fresh Aged

Figure 6: Adding ageing margin in the voltage stack (top) allows binning good parts (middle) which results in a tighter initial distribution. For similar drift, the tighter distribution yields to a reduced failure rate (bottom).

3E.5.4

This trade-off between yield loss and improved failure rate

has been exemplified in figure 7 where aggressive voltage range was intentionally chosen to generate failures with respect to the overall tested sampling. Achieving a low failure rate under these assumptions might sometimes require operating low yield volume production which is not cost effective.

0%

20%

40%

60%

80%

100%

120%

Failure rate

Porc

esso

r yie

ld

Ageing margin (a.u.)

Increased ageing margin

Constant nominal voltage

Figure 7: Dual-core application processor in 45nm SiON node measured and modeled yield and failure rate evolution as a function of ageing margin and for a constant nominal voltage.

A second approach to introduce an ageing margin is to maintain the minimum voltage Vmin constant, and so to be able to guarantee a targeted yield. In this case, the nominal voltage is arisen by the amount of the ageing margin. As a consequence, the failure rate is equivalently lowered at the cost of an additional power increase (cf. figure 8). Overall, the power increase is limited, since it is linearly dependent of the ageing margin for a given fixed PLL frequency.

0%

2%

4%

6%

8%

10%

12%

14%

Failure rate

Porc

esso

r pow

er in

crea

se


Increased ageing margin

Varyingnominal voltage

Figure 8: Dual-core application processor in 45nm SiON node measured and modeled power increase and failure rate evolution as a function of ageing margin and for an increasing nominal voltage.

More interestingly, our modeling approach allows taking into account process centering variations which might happen from one fab to another fab. It results that for similar conditions (voltage, mission profile…), a faster corner process yields to lower failure rate (cf. figure 9). It was also found that the rate of failure rate reduction with ageing margin increase remains unchanged with process corner.

Failu

re ra

te (a

.u.)


Faster Process

1 FR decade / 10mV

Figure 9: Dual-core application processor in 32nm Hi-K node failure rate evolution as a function of ageing margin and for an increasing nominal voltage. Three different process centerings have been considered: Fast Process (circles), Typical Process (triangles) and Slow Process (squares). 1 decade of failure rate reduction is experimentally observed per 10mV of ageing margin inline with our modeling framework results (lines).

To conclude this part, we have proposed a robust flow to allow accurate ageing margin determination. This flow had been validated on silicon for both 45nm SiON and 32nm Hi-K nodes using real dual-core application processors. It allows a full flexibility to analyze the impact of other reliability constraints as the mission profile and the thermal runaway.

VI. MISSION PROFILE AND MAXIMUM PERFORMANCE In advanced application processors, the generalized use of

DVFS approach generates a variety of discrete voltage-frequency (V-F) operating modes ranging from power-driven, low-frequency ones to performance-driven, high frequency ones. The operating modes conditions optimization becomes a tricky process because the multiple “what-if” analyses required.

In this part, we will focus on the optimization of the operating conditions of the performance-driven, high-frequency mode. In this case, the goal is to maximize the frequency (and somehow to increase the supply voltage) without compromising the reliability.

The Safe Operating Area approach described above allows determining the maximum performance achievable with respect to the product failure rate target. Among the key elements to run the analysis, the amount of time spent in this mode is a first order parameter, generally issued either from field- monitoring dataset or from lab-based measurements [12].

Figure 10 shows how this Safe Operating Area is evolving for three different times (increasing from top to bottom) spent in the most performance demanding operating mode. It can be observed that the maximum frequency achievable is reduced in a non-negligible way with increasing time.

3E.5.5

Frequency (GHz)

Nom

inal

Vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Frequency (GHz)

Nom

inal

vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Frequency (GHz)

Nom

inal

vol

tage

(a.

u.)

1.51.41.31.21.11.00.90.80.7

Figure 10: Evolution of Safe operating Areas for a dual-core application processor in 45nm SiON node for three different stress time from the shortest (top) to the longest (bottom) ones. Maximum reliable performance is solely linked to the mission profile while design and silicon performances are identical in all these cases. The maximum performance is strongly influenced by the stress time and requires adequate modeling.

Based upon these Safe Operating Areas, the optimization of the high-performance Voltage-Frequency (V-F) mode can be done in three different ways:

- The first way is to consider that the achievable frequency for the longest stress time (for example 10 years) is sufficient for the considered application. In this case, the approach is to lower the supply voltage to minimize the power consumption to achieve this frequency for the targeted use time.

- The second way is to increase the frequency (for lowering high performance mode use time) while maintaining the supply voltage constant to keep power consumption under control.

- The last and third way is to increase both the supply voltage and the frequency to achieve the absolute maximum performance at the cost of increased power consumption.

0

5

10

15

20

25

30

35

40

45

50

0.1 1 10

Proc

esso

r fre

quen

cy c

hang

e (%

)

Equivalent use time (yrs) @ 125°C

45nm SiON -- constant V, varying F45nm SiON -- varying V, constant F45nm SiON -- varying V, varying F32nm HiK -- constant V, varying F32nm HiK -- varying V, constant F32nm HiK -- varying V, varying F

-20

0

20

40

60

80

100

120

0.1 1 10

Proc

esso

r po

wer

cha

nge

(%)

Equivalent use time (yrs) @ 125°C

45nm SiON -- varying V, constant F45nm SiON -- constant V, varying F45nm SiON -- varying V, varying F32nm HiK -- constant V, varying F32nm HiK -- varying V, constant F32nm HiK -- varying V, varying F

Figure 10: Relative impact of the equivalent use time (in years) at 125°C on the processor frequency (top) and power consumption (bottom) with respect of the three different optimization approaches.

The relative impact of these three different approaches on processor frequency and power consumption as a function of high performance mode use time is shown in figure 11 for both 45nm SiON node (squares) and 32nm HiK node (circles). Both technologies present very similar behaviors independently of the approach under consideration. In two cases, performance optimization process yields to an increased power consumption. This increase might limit the performance increase either due to a battery lifetime limit or a thermal runaway risk.

VII. RELIABILITY AND THERMAL MANAGEMENT The strong power consumption increase might threaten the

risk of thermal runaway, especially in wireless applications where no dedicated cooling systems can be embedded. This

3E.5.6

part will address the basics of how this risk can be efficiently managed through our approach.

-20 0 20 40 60 80 100 120

Max

imum

pac

kage

ther

mal

resi

stan

ce (a

.u.)

Processor power change (%)

varying F, constant V

varying V, varying F

Thermal runaway risk

V reduction at F constant

V and F increase

Figure 12: Maximum allowed package thermal resistance versus dual-core application processor power consumption change in 32nm Hi-K node as a function of equivalent use time at 125°C. The black dotted line indicates the upper bound for thermal runaway risk.

Thermal runaway may occur in a semiconductor packaged application when the power dissipation of the device in question increases as a function of temperature. In more particular, it describes the situation when no nominally steady−state operating point of the device, under the influence of the specific thermal system, can be established. Ordinarily, of course, a device that dissipates a fixed amount of power can always achieve a steady state operating condition, though the specific junction temperature attained may fall beyond recommended limits. If the thermal system around the device is characterized as having steady-state power consumption, then this equilibrium (or steady state) condition may be described as a maximum allowed thermal resistance. In this description, if the maximum allowed thermal resistance is greater than actual packaged application composite one, there is no risk of thermal runaway. The critical limit is thus defined solely by the packaged application composite thermal resistance. This situation is illustrated in figure 12.

The maximum frequency increase (in conjunction with voltage one), though possible if only electrical reliability is considered, might finally become limited from a thermal runaway perspective. These results clearly indicate that both electrical reliability and thermal runaway constraints need to be conjointly taken into account whenever an application processor performance optimization process is put in place for wireless applications.

VIII. CONCLUSIONS Optimizing the performance/reliability/power trade-off in

commercial high-performance microprocessors for wireless applications is becoming a real challenge in advanced CMOS nodes due to both an increased impact of electrical reliability degradation and an increased power consumption yielding to an increased thermal runaway risk. Overall, this paper presents a new modeling framework to be used in conjunction with product engineering dataset to achieve optimal reliability guardband determination even in the case of numerous, discrete V-F operating modes and related mission profile.

ACKNOWLEDGMENT The authors would like to thank the overall

STMicroelectronics and ST-Ericsson design and test teams for their outstanding support.

REFERENCES [1] Y. Leblebici, S.M. Kang, "Hot-carrier reliability of MOS VLSI circuits",

Kluwer Academic Publishers (1993) [2] R. Thewes, et al., Microelectronics Reliability 40 (2000) 1545-1554 [3] V. Reddy, et al. IEEE IRPS Proc. (2002) 248-254 [4] C. R. Parthasarathy, et al. Micro. Reliability, 46 (2006) 1464-1471 [5] V. Huard, et al. IEEE IRPS Proc. (2012) [6] V. Reddy, et al. IEEE ITC Proc. (2004) [7] A. Khrisnan, et al. IEEE IEDM Proc. (2010) [8] K. Van Dijk, et al. IEEE IRPS Proc. (2010) [9] M. Wiatr, et al. IEEE IRPS Proc. (2009) [10] Y. H. Lee, et al. IEEE IRPS Proc. (2007) [11] K. Bowman, et al. IEEE ICCAD Proc. (2004) [12] R. Kwasnick, et al. IEEE IRPS Proc. (2011)

3E.5.7

Documents

[IEEE 2013 IEEE International Reliability Physics Symposium (IRPS) - Anaheim, CA (2013.4.14-2013.4.18)] 2013 IEEE International Reliability Physics Symposium (IRPS) - Advances in industrial