21
The Design and Analysis of Thermal-Resilient Hard-Real-Time Systems Pradeep M. Hettiarachchi 1 , Nathan Fisher 1 , Masud Ahmed 1 , Le Yi Wang 2 , Shinan Wang 1 , and Weisong Shi 1 1 Department of Computer Science 2 Department of Electrical and Computer Engineering Wayne State University {pradeepmh, fishern, masud, lywang, shinan, weisong}@wayne.edu Abstract—We address the challenge of designing predictable real-time systems in an unpredictable thermal environment where environmental temperature may dynamically change (e.g., im- plantable medical devices). Towards this challenge, we propose a control-theoretic design methodology which permits a system de- signer to specify a set of hard-real-time performance modes under which the system may operate. The system automatically adjusts the real-time performance mode based on the external thermal stress. We show (via analysis, simulations, and a hardware testbed implementation) that our control-design framework is stable and control performance is equivalent to previous real-time thermal approaches, even under dynamic temperature changes. A crucial and novel advantage of our framework over previous real-time control is the ability to guarantee hard deadlines even under transitions between modes. Furthermore, our system design permits the calculation of a new metric called thermal resiliency which characterizes the maximum external thermal stress that any hard-real-time performance mode can withstand. Thus, our design framework and analysis may be classified as a thermal stress analysis for real-time systems. Index Terms—thermal resiliency; multi-mode system; thermal- aware system; thermal-aware periodic resource; I. I NTRODUCTION Modern computer-controlled systems are often deployed in dynamic and unpredictable thermal operating environments. From the hardware-design perspective, material scientists and computer engineers use rigorous thermal-stress analysis tech- niques (e.g., see [1]) to determine how the underlying physical hardware will withstand applied internal and external thermo- dynamic forces. Unfortunately, equivalent analysis does not exist for determining the effects of (unpredictable) thermal stress on the performance of the systems software. While hard- ware capabilities such as dynamic power management (DPM) permit a computing system to reduce its power dissipation at run-time, many embedded systems have real-time constraints which may be adversely affected by unexpected changes in processor speed. As an example of an embedded system where thermal- stress analysis is essential, consider microprocessors found in implantable medical devices (IMDs). IMDs are increasingly being used to treat various diseases and medical conditions (e.g., pacemakers for heart disease or neural implants to restore hearing/vision). However, recent studies [2], [3] have shown that the heat dissipated from IMDs due to the microprocessor activity is non-negligible. Thus, designing IMDs with mini- mum thermal dissipation is critical as medical research has shown that a temperature increase of even 1 C can have long-term effect on tissue [4] and, in the extreme, death may even result from excessive tissue heating [5]. Complicating the safe thermal design of IMDs, body temperature naturally fluctuates over time and varies depending on location [6]. An IMD designer must balance (under temperature fluctuations) the real-time computational requirements of the device with the non-harmful thermal operating limits. In the presence of an increased surrounding temperature, an IMD will have to reduce its computational load to prevent tissue damage due to heat 1 . However, as the correct and safe functioning of the IMD is an absolute requirement, the system designer requires techniques to formally verify the effect of different body temperatures on the correct operation of the IMD. Similarly, as a less safety-critical example, consider how the quality of audio/video decoding may degrade in a hand-held device as the system reacts to increases in temperature by reducing computational processing (e.g., via instruction fetch toggling). Ideally, a system designer would like to determine how much the performance will degrade under different thermal operating conditions. Unfortunately, no current formal real-time design and anal- ysis framework fully addresses the above setting. Recently- proposed control-theoretic frameworks exist for regulating processor temperature for soft -real-time systems (i.e., systems where jobs are permitted to “occasionally” miss computational deadlines) in an unpredictable thermal environment [8], [9]. While their results successfully show that it is possible to ob- tain stable and responsive thermal behavior and system utiliza- tion control, a system designer cannot use their approaches to a priori determine the amount of system-performance degra- dation due to changes in the thermal environment. Instead, the level of degradation can only be indirectly inferred via simulations of the system for different operating conditions. Furthermore, hard timing guarantees cannot be made in these frameworks. Techniques also already exist for permitting a 1 As IMD microprocessors typically do not have DVS capabilities, an IMD may have to reduce non-essential tasks such as communication with other nodes in a body-area network [7].

The Design and Analysis of Thermal-Resilient Hard-Real-Time Systems

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

The Design and Analysis of Thermal-ResilientHard-Real-Time Systems

Pradeep M. Hettiarachchi1, Nathan Fisher1, Masud Ahmed1, Le Yi Wang2, Shinan Wang1, and Weisong Shi1

1Department of Computer Science2Department of Electrical and Computer Engineering

Wayne State University{pradeepmh, fishern, masud, lywang, shinan, weisong}@wayne.edu

Abstract—We address the challenge of designing predictablereal-time systems in an unpredictable thermal environment whereenvironmental temperature may dynamically change (e.g., im-plantable medical devices). Towards this challenge, we propose acontrol-theoretic design methodology which permits a system de-signer to specify a set of hard-real-time performance modes underwhich the system may operate. The system automatically adjuststhe real-time performance mode based on the external thermalstress. We show (via analysis, simulations, and a hardware testbedimplementation) that our control-design framework is stable andcontrol performance is equivalent to previous real-time thermalapproaches, even under dynamic temperature changes. A crucialand novel advantage of our framework over previous real-timecontrol is the ability to guarantee hard deadlines even undertransitions between modes. Furthermore, our system designpermits the calculation of a new metric called thermal resiliencywhich characterizes the maximum external thermal stress thatany hard-real-time performance mode can withstand. Thus, ourdesign framework and analysis may be classified as a thermalstress analysis for real-time systems.

Index Terms—thermal resiliency; multi-mode system; thermal-aware system; thermal-aware periodic resource;

I. INTRODUCTION

Modern computer-controlled systems are often deployed indynamic and unpredictable thermal operating environments.From the hardware-design perspective, material scientists andcomputer engineers use rigorous thermal-stress analysis tech-niques (e.g., see [1]) to determine how the underlying physicalhardware will withstand applied internal and external thermo-dynamic forces. Unfortunately, equivalent analysis does notexist for determining the effects of (unpredictable) thermalstress on the performance of the systems software. While hard-ware capabilities such as dynamic power management (DPM)permit a computing system to reduce its power dissipation atrun-time, many embedded systems have real-time constraintswhich may be adversely affected by unexpected changes inprocessor speed.

As an example of an embedded system where thermal-stress analysis is essential, consider microprocessors found inimplantable medical devices (IMDs). IMDs are increasinglybeing used to treat various diseases and medical conditions(e.g., pacemakers for heart disease or neural implants to restorehearing/vision). However, recent studies [2], [3] have shownthat the heat dissipated from IMDs due to the microprocessor

activity is non-negligible. Thus, designing IMDs with mini-mum thermal dissipation is critical as medical research hasshown that a temperature increase of even 1◦C can havelong-term effect on tissue [4] and, in the extreme, death mayeven result from excessive tissue heating [5]. Complicatingthe safe thermal design of IMDs, body temperature naturallyfluctuates over time and varies depending on location [6]. AnIMD designer must balance (under temperature fluctuations)the real-time computational requirements of the device withthe non-harmful thermal operating limits. In the presence ofan increased surrounding temperature, an IMD will have toreduce its computational load to prevent tissue damage dueto heat1. However, as the correct and safe functioning of theIMD is an absolute requirement, the system designer requirestechniques to formally verify the effect of different bodytemperatures on the correct operation of the IMD. Similarly,as a less safety-critical example, consider how the qualityof audio/video decoding may degrade in a hand-held deviceas the system reacts to increases in temperature by reducingcomputational processing (e.g., via instruction fetch toggling).Ideally, a system designer would like to determine how muchthe performance will degrade under different thermal operatingconditions.

Unfortunately, no current formal real-time design and anal-ysis framework fully addresses the above setting. Recently-proposed control-theoretic frameworks exist for regulatingprocessor temperature for soft-real-time systems (i.e., systemswhere jobs are permitted to “occasionally” miss computationaldeadlines) in an unpredictable thermal environment [8], [9].While their results successfully show that it is possible to ob-tain stable and responsive thermal behavior and system utiliza-tion control, a system designer cannot use their approaches toa priori determine the amount of system-performance degra-dation due to changes in the thermal environment. Instead,the level of degradation can only be indirectly inferred viasimulations of the system for different operating conditions.Furthermore, hard timing guarantees cannot be made in theseframeworks. Techniques also already exist for permitting a

1As IMD microprocessors typically do not have DVS capabilities, an IMDmay have to reduce non-essential tasks such as communication with othernodes in a body-area network [7].

trade-off between real-time QoS and processing resources(e.g., the QoS-based resource allocation model (QRAM) [10]);however, while such techniques may guarantee real-time dead-lines under a fixed level of resources, they cannot guaranteedeadlines when a system must dynamically switch betweenreal-time modes (due to the uncompleted execution remainingat mode transitions). Furthermore, none of these previously-proposed techniques can be used to obtain a precise, formalquantification of the thermal stress that the system can with-stand.

In this paper, we address the challenge of determining thereal-time guarantees in the presence of unpredictable dynamicenvironmental conditions. Towards this goal, we propose aframework and mechanisms for thermal-stress analysis in real-time systems. Our objective is to develop techniques thatpermit a system designer to specify, a priori, a precise quan-tification of the hard-real-time performance degradation due toexternal thermal events, via a new system design metric calledreal-time thermal resiliency. Informally, real-time thermalresiliency is a prediction of the maximum external operatingtemperature at which a specified real-time performance mode(e.g., quality-of-service) may be guaranteed in the systemsteady-state (i.e., a time at which system properties haveconverged and do not change). To illustrate, consider a systemwith q different (system designer-defined) hard-real-time per-formance modes M0,M1, . . . ,Mq where modes are ordered inincreasing levels of real-time performance with Mq guarantee-ing the highest level and M0 the lowest. The real-time thermalresiliency of any mode Mi, denoted as Λ(Mi, Tref), is thepredicted maximum external operating temperature for whichthe system will continue to operate (in the steady state) atperformance mode Mi or higher and maintain a CPU referencetemperature of Tref. Furthermore, if the external temperatureexceeds Λ(Mi, Tref), then the system should automaticallydegrade to the next lowest performance mode Mi−1. Thecapability to define (at system-design time) thermal-resilient,real-time performance modes allows the system designer tospecify how a system will gracefully and predictably de-grade under external thermal stress; furthermore, the abilityto accurately determine the real-time thermal resiliency of aperformance mode provides a real-time system designer with athermal-stress analysis framework analogous to stress analysistechniques in physical sciences and engineering. In the IMDexample above, the thermal-resiliency function Λ may be usedto determine (at design time) the body-temperature that a givenset of tasks may safely operate at without doing damage tosurrounding tissue.§Organization. This paper presents a methodology for de-signing and analyzing thermal-resilient hard-real-time systems.Section II presents a high-level overview of our methodologyand gives more detail on the contributions of this paper. Sec-tion III presents a brief review of previous work on thermal-aware (real-time and non-real-time) computer systems. SectionIV presents the hardware, real-time, and thermal modelsused throughout the paper. Section V details the design of

our thermal-resilient controller. Section VI derives thermal-resiliency function Λ for control system. Section VII describesthe results of our comparison with previous control systemsvia simulation and implementation upon testbed hardware. Ourmethodology provides formal system guarantees which requireformal derivations and proofs. In the interest of space, we havedeferred all formal proofs and derivations to the appendix ofan extended version of this paper [11].

II. METHODOLOGY OVERVIEW

We now describe at a high level the major steps of ourthermal-resilient design and analysis methodology.

1) System Hardware Specification: In the first step, thesystem designer must specify the processing and DPMcapabilities of the system. Throughout this paper, wewill be illustrating and validating our methodology uponan Intel Pentium IV 3.0 GHz single-core processortestbed. To match the rudimentary DPM capabilitiesoften present in embedded processors, our testbed pos-sesses the ability to only modulate the power modes ofthe system between active and inactive states. SectionIV-A gives more detail on the hardware model and ourtestbed implementation details.

2) System Software Specification: The system de-signer must specify the set of valid software modesM0,M1, . . . ,Mq for the system. In Section IV-B, wediscuss using the sporadic task model [12] as a modelfor real-time workload of each software mode.

3) Real-Time Mode Resource Allocation: After theHW/SW specification steps, the designer must determinethe minimum resource allocation under which the multi-mode system is schedulable. We discuss in SectionIV-B how recent techniques for schedulability analysisof hard-real-time systems where both the hardware andsoftware change modes may be used in allocating suffi-cient processing time to each mode.

4) Power/Thermal Model Evaluation: Given the process-ing platform, we need an accurate power model in orderto derive formal guarantees on the thermal resiliency ofthe system. Due to the duality between electrical andthermal circuits, we model the thermodynamics of ourprocessing system using the resistance/capacitance (RC)circuits. We use system identification (SI) to identifythe system parameters and evaluate the efficacy of ourpower-model choice. Due to space constraints, the de-tails on the derived parameters for our hardware testbedare in the appendix of the extended version of our paper[11].

5) Control System Design: We design a control structurebased on optimal control theory. In this process, we usethe SI parameters (determined in the previous step) todesign the feedback gain parameters. We present detailson our controller design in Section V.

6) System Simulation: We build a system simulator whichimplements the real-time scheduling algorithm and con-trol algorithm and simulates the real-time and thermal

behavior of the system based on the resource allocationsand power model derived in Steps 3 and 4. The detailsof our simulator are provided in Section VII.

7) Thermal-Resiliency Function Calculation: Given thereal-time mode resource allocation, power model, con-troller, and simulator observations obtained from Steps3, 4, 5, and 6 we can obtain a quantification of thethermal-resiliency function Λ. We give details on thederivation of this function in Section VI.

8) System Validation: We finally validate our systemsimulator and thermal-resiliency calculations in SectionVII by comparing directly with observations from ourhardware testbed. Our comparison shows that the systemsimulator closely models the actual testbed behavior.Furthermore, we validate that our predicted thermal-resiliency Λ function is accurate by observing that itclosely tracks the actual hardware testbed behavior.

While most of the steps above are standard practice incontrol system design, we would like to emphasize that ourability to ensure the hard-real-time schedulability of eachmode in Step 3 and obtain a priori guarantees on thermalresiliency in Step 7 distinguishes our approach from previousthermal control for real-time systems.

III. RELATED WORK

In this section, we give a brief, high-level overview ofprevious research in both general (non-real-time), thermal-aware system design and real-time-specific thermal-aware de-sign. For non-real-time systems, Brooks and Martonosi [13]investigated major components of any dynamic thermal man-agement scheme and suggested policies and mechanisms forimplementing dynamic thermal management for current andfuture high-end CPUs. They evaluated the benefits of usingdynamic thermal management to reduce the cooling systemcosts of CPUs and developed an architectural-level powermodeling tool called Wattch. For the micro-architecture levelof thermal modeling, Skadron et al. [14] proposed a compact,dynamic, and portable thermal model and a tool called HotSpotfor use at the architecture level for micro-architectures.

For real-time systems in the online setting, Bansal andPruhs [15] explored algorithms for minimizing both peak-temperature and energy efficiency for online jobs with deadlineconstraints. In the off-line setting, previous work on schedulingunder thermal constraints has followed two main approaches:reactive and proactive schedulers. In a reactive scheduler, theprocessor speed is reduced in response to a thermal trigger.Wang et al. [16] studied schedulability analysis under thereactive setting. In the proactive setting, the speed schedule forthe processor is determined at design time. Chen et al. [17]addressed proactive scheduling for the periodic task model.Quan and Zhang [18] consider feasibility analysis of leakage-aware periodic tasks under temperature constraints. However,previous work on both settings assumed either simple taskmodels or the existence of “ideal” processor speeds. Ourproposed control framework may be considered a proactive

scheduler; however, we attempt to remove some ideal as-sumptions by working with only two power modes and themore general sporadic task model. Also, we consider theambient temperature changes and analyze the effects on thetask system due to its variation. Recent dynamic temperaturemanagement strategies also exist for multiprocessor real-timesystems [19]–[21]; however, most of these focus upon staticspeed-assignment approaches and not a proactive schedule.Thermal analysis has also been studied in the context ofweb servers [22], but hard deadlines are not guaranteed. Asmentioned in the introduction, work by Y. Fu et al. [8] and X.Fu et al. [9] address handling unpredictable thermal events;however, the results do not provide any a priori guaranteesthat may be used to equate real-time performance and thermalresiliency.

IV. MODELS

A. System Hardware Model and Testbed

For this paper, we consider a single processor system withrudimentary DPM capabilities of only active and inactivepower modes. At any time t > 0, we denote the instantaneousCPU power as Pcpu(t). The processor dissipates thermal powerat a constant rate Pcpu(t) = Pact in the active mode andPcpu(t) = Pinc in the inactive mode. Also, we assume thatprocessor consumes eact amount of energy to activate frominactive mode and einc amount of energy to deactivate fromthe active mode. Even though the processor may be minimallyactive while in the low-power state, we will assume (asa pessimistic assumption for the purpose of schedulabilityanalysis) that the processor is unavailable for task executionduring this interval. If the aforementioned assumption doesnot hold, the system will behave “better” than the analysisand our results will continue to be valid. We believe thismodel of active/inactive modes is a very general model,applicable to a large number of available embedded processorswith rudimentary DPM capabilities. For ideal processors withcontinuous power modes, Pcpu(t) may be selected from therange [0,Pact].

Our control system for the active/inactive processor willenforce strict periodic mode changes. For this purpose, weemploy a recently proposed thermal-aware periodic resource[23] model, which is an extension of the well-known pe-riodic resource model proposed by Shin and Lee [24] forcompositional real-time systems. In the thermal-aware periodicresource model, the processing resource is characterized witha two-tuple (Π,Θ). The parameter Π is called the resourceperiod and Θ is called the resource capacity. We will as-sume that Π is a non-negative integer (likely subject to thesystem tick granularity). The interpretation is that processorwill be active for Θ amount of time at the beginning ofeach successive Π-length intervals. The ratio Θ/Π is calledthe resource bandwidth. Within each processor allocation, anarbitrary uniprocessor scheduling algorithm (e.g., EDF or RM)may be employed to schedule the underlying task system(see next subsection). See Figure 1 for an illustration of the

thermal-aware periodic resource.As a case study of our methodology, we have built a

hardware testbed using an Intel Pentium IV 3.0 GHz singlecore processor running a modified Linux kernel (2.6.33.7.2-rt30 PREEMPT RT). We have developed device drivers toactivate the CPU modulation from the user-space using ModelSpecific Registers (MSR) of the processor to create a highand low frequency modulation for active/inactive power states.For obtaining the system temperature, we follow the officialprocedure given by Intel [25] to install a thermal sensor tomeasure the die temperature with best possible accuracy. APhidgets four-port thermal sensor with ambient temperaturesensor was used to measure the air and environment temper-ature. A detailed description on the testbed is given in theextended version of this paper [11].

B. System Software Model

In the introduction, we proposed a system model ofreal-time performance modes M1, . . . ,Mq . For the purposeof this paper, we will assume each performance modeMi is characterized by a sporadic task system2 [12] withni tasks and the resource capacity Θ(i). That is, Mi =({

τ(i)1 , τ

(i)2 , . . . , τ

(i)ni

},Θ(i)

)where each τ

(i)j ∈ Mi is a

sporadic task characterized by a three-tuple (e(i)j , d(i)j , p

(i)j ) and

Θ(i) is the minimum capacity required to meet the deadlinesof the tasks of Mi. (Note that we are abusing notation byallowing Mi to represent the set of tasks and the two-tupleof the mode’s task system and required resource capacity.) Inthis three-tuple representation for a task, e(i)j is the worst-caseexecution requirement, d(i)j is the relative deadline, and p

(i)j is

the minimum inter-arrival separation parameter (historicallycalled the “period”). A sporadic task τ

(i)j may produce a

(potentially infinite) sequence of jobs, where each job has anexecution requirement of e

(i)j time units and must complete

d(i)j time units after its arrival. The first job of τ (i)j may arrive

at any time after system-start time; however, successive jobsof τ

(i)j must arrive at least p

(i)j time units apart. For this

paper, we assume that the resource period Π is identical in allmodes. For mode Mi, a resource capacity of Θ(i) is providedevery resource period. Figure 1 illustrates the processing-timeallocation in two different modes.

We will assume that there is an ordering of real-time per-formance modes based on their “computational requirements”to meet all of a mode’s deadlines. The relation Mi ≽ Mj

indicates that Mi is more computationally intensive than Mj .For notational convenience, we will assume that mode M0

represents the mode where with no tasks and Θ(0) equal tozero. Furthermore, for this paper, we assume that the modesare well-ordered and have been indexed in increasing orderof computational requirements; i.e., M0 ≼ M1 ≼ M2 ≼. . . ≼ Mq . While there are many possible ways to define the

2Note, we will be assuming the sporadic task model throughout ourobjectives, but the results could be extended to other task models withoutmuch change.

Θ(i)

Π

Mode Change

Θ(i) Θ(i) Θ(j)

Mode Change

Θ(j) Θ(j)

Fig. 1: The sampling and mode change in our thermal control system.The blocks indicate time periods during with the processor is activeunder the thermal-aware periodic resource model. Sporadic tasks arescheduled within the activation blocks.

≼ relation, the only ordering required from the perspectiveof our thermal control is that Mi ≼ Mj , if and only if,Θ(i) ≤ Θ(j); i.e., to reduce the temperature of the system,we need to decrease the processing-time allocation.

Our model does not require any particular mode-changesemantics to be adopted. Some potential options for dealingwith incompletely-executed jobs upon a mode change are: (i)aborting any incomplete jobs; (ii) delaying the release of jobsin the new mode until all jobs of the old mode have completed;and (iii) allowing jobs of the new mode to be released, as soonas legally allowable, while jobs of the old mode are still active.For the purposes of our hardware testbed and simulations(Section VII), we assume option (iii).

The scheduling of real-time performance mode Mi upon thethermal-aware periodic resource may be done by any unipro-cessor real-time scheduling algorithm (e.g., earliest-deadline-first or rate-monotonic [26]). However, Θ(i) must be suffi-ciently large for the scheduling algorithm to correctly scheduleall jobs of the task set of Mi (i.e., {τ (i)1 , τ

(i)2 , . . . , τ

(i)ni }) and

(potentially) any jobs from the previous mode that have notcompleted by the mode change. To obtain a proper resourceallocation, Θ(i), for each mode, we use our recently-developedhard-real-time schedulability test (for EDF scheduling underhardware/software mode changes in the periodic resourcemodel) to search for a safe value of Θ(i) for each mode [27]to ensure that deadlines are always met.

C. Power/Thermal Model

We use the duality principle in electrical and thermal circuitsto describe the dynamics of the power dissipating source usingelectrical resistance/capacitance (RC) circuits. Figure 2 showsthe basic equivalent circuit for the CPU and its surrounding en-vironment. We assume that total dissipated power of the CPUPcpu is equal to the sum of the power due to dynamic currentPd

cpu and power due to leakage current P lcpu. Furthermore, we

assume that the temperature-dependant leakage power may beclosely approximated by a linear function of CPU temperature[28].

Let Vcpu(t), Venv(t), and Vair(t) represent the equivalentvoltages for equivalent temperatures of the CPU, environment,and air (room) respectively. Let Tcpu be the instantaneousrelative temperature of the CPU with respect to the immediateenvironment (e.g., CPU casing), Tenv be the relative tempera-ture of the immediate environment with respect to the room airtemperature, and Tair be the (absolute) room air temperature.

Vcpu(t) = Tcpu(t)

Venv(t) = Tenv(t)

Vair(t) = Tair(t)

Pdcpu

P lcpu

Penv

Fig. 2: The basic equivalent circuit for a working CPU and its workingenvironment

For example, if Tair is 20◦C, Tenv is 10◦C, and Tcpu is 15◦C,then the absolute temperature of the CPU is 45◦C.

Let Pdcpu(t), P l

cpu(t), and Penv(t) represent, respectively,the dynamic CPU, leakage CPU, and environment powerdissipation. Let Rd

cpu, Rlcpu, Renv, Cd

cpu, C lcpu, and Cenv represent

the dynamic and leakage thermal resistance, environmentresistance, CPU dynamic and leakage capacitance, and envi-ronment capacitance. Finally, let σ1

def= 1

Cdcpu+Cl

cpuand kT and

kC represent processor-dependent constants used in approxi-mating the temperature-dependant leakage current. ApplyingKirchhoff’s circuit laws, we get the following equations forTcpu(t),

Tcpu(t)

Rdcpu

+ Cdcpu

d

dtTcpu(t) = Pd

cpu(t) (1)

Tcpu(t)

Rlcpu

+ C lcpu

d

dtTcpu(t) = P l

cpu(t) (2)

= kT(Tcpu(t) + Tenv(t)

)+ kC .

Solving (2) for ddtTcpu(t),

d

dtTcpu(t) = σ1

(kT − 1

Rlcpu

− 1

Rdcpu

)Tcpu(t)

+ kTσ1Tenv(t) + σ1Pdcpu(t) + σ1kC . (3)

We obtain the following equation for Tenv(t),

Tenv(t)

Renv+ Cenv

d

dtTenv(t) = Pcpu(t) + Penv(t), (4)

= Pdcpu(t) + P l

cpu(t) + Penv(t).

Solving (4) for ddtTenv(t),

d

dtTenv(t) =

kTCenv

Tcpu(t) +1

CenvPd

cpu(t) +1

CenvPenv(t)

+( kTCenv

− 1

RenvCenv

)Tenv(t) +

kCCenv

. (5)

If we know the temperature of the environment and CPUat some initial time t0 ≤ t, then we can derive followingEquations from (3) and (5):

z−1+

+C

G

K

H+

-

∫+

-γI

ve(k) x(k)y(k)

Tref(k)− Tair(k)

u(k)

f

Fig. 3: The thermal control design with state feedback and integralactuator

Tcpu(t) =

∫ t

t0

σ1Pcpu(s)e−(t−s)β1ds+ Tcpu(t0)e

−(t−t0)β1 , (6)

Tenv(t) =

∫ t

t0

σ2

(Penv(s) + Pcpu(s)

)e−(t−s)β2ds

+ Tenv(t0)e−(t−t0)β2 . (7)

where β1def= ( 1

Rdcpu

+ 1

Rlcpu

−kT )· 1

(Cdcpu+Cl

cpu), β2

def= 1

RenvCenv−

kT

Cenv, and σ2

def= 1

Cenv. According to the Figure 2 shown above,

the absolute CPU temperature can be calculated as Tcpu(t) +Tenv(t) + Tair(t).

V. CONTROLLER DESIGN

We first present the standard state-space model used incontrol theory in Section V-A. In Section V-B, we designa thermal controller assuming that an ideal system withcontinuous power modes. In Section V-C, we will extend thecontroller design to a processor with only active/inactive powermodes.

A. State-Space Model Basics

We use the standard state-space model to representcontinuous-time (ideal) system

x(t) = Ax(t) +Bu(t) + f,

y(t) = Cx(t), (8)

where x(t), u(t), and y(t) represent the state vector, theinput vector, and the output vector, respectively. A,B, andC represent the system matrices and f represents a constantvector. Both the state matrices and constant vector are time-invariant quantities.

Since we have a computer-controlled discrete-time system,we will use following state-space mode for the discrete-timecontroller for active/inactive modes. For a sampling intervalTs, u(t) is a constant and the sampled system of Equation (8)is

x((k + 1)Ts) = Gx(kTs) +Hu(kTs) + f ,

y(kTs) = Cx(kTs), (9)

where G = eATs , H =∫ Ts

0eAtBdt, C = C, and f =∫ Ts

0eAtfdt. The term eAt can be computed by L−1{(sI −

A)−1}, where L−1 is the inverse Laplace transform. In theremainder of the document, we abuse the notation by repre-senting x(kTs) as x(k), x((k + 1)Ts) as x(k + 1), u(kTs)as u(k), and y(kTs) as y(k). The above definitions may befound in any textbook on discrete-time control theory [29].

B. Continuous Power Modes

As a first step towards our goal of designing a control-theoretic framework for thermal stress analysis, we employlinear quadratic (LQ) optimal control for real-time thermalmanagement. Our design consists of an optimal state feedbackand a servo that regulates the dynamics of the system. AnLQ controller enables us to design an efficient and low-overhead controller, derive the feedback parameters beforeruntime (used in thermal-resiliency analysis), and smoothlytrack our reference input. In the future, we plan on applyingmore complex and robust controllers (e.g., H∞ controllers)to decrease the controller’s sensitivity to modeling inaccuracyand noise. However, as observed in the simulations and exper-iments of Section VII, our current LQ design is appropriatelyresponsive to changes in environmental temperature.

In our system model, we specify the thermal power ofthe CPU as the control to the system. The controller isrequired to work as a servo and should follow the temperaturereference, Tref. In our design, we consider Tcpu(t) as one of thevariable to be controlled and Pd

cpu(t) as a manipulated variable(equivalent to y(t) and u(t), respectively, in continuous state-space model). The basic control structure is given in Figure 3.

From Equations (3) and (5), the continuous-time state spacemodel can be written as

[Tcpu(t)

Tenv(t)

]=

[−β1 kTσ1

kTσ2 −β2

] [Tcpu(t)Tenv(t)

]+

[σ1

σ2

]Pd

cpu(t) +

[0σ2

]Penv(t). (10)

While our analysis below is in the continuous-time domain,a discrete-time control system approach would be applied in anactual computer implementation. Therefore, we now note thatwe may easily convert the continuous-state space model to thediscrete-time sampled system, x(k+1) = Gx(k)+Hu(k)+f

from the continuous-time state matrices A =

[−β1 kTσ1

kTσ2 −β2

]and B =

[σ1

σ2

]where k is the sampling index, Ts is sampling

interval, and G and H can be calculated as described in

Section V-A. For our given system, x(k) ≡[Tcpu(k)Tenv(k)

]and

u(k) ≡[Pcpu(k)

]where we are again abusing notation for

the T and P functions.To eliminate steady state tracking error, we design our

system as a servo with an integrator. Define an additional error

vector ve(t) in continuous time as,

ve(t)def=

∫ t

0

(Tref − T (t)− Tair(t))dt

ve(t)def= Tref − Tair(t)− T (t) (11)

= −C

[Tcpu(t)Tenv(t)

]+ Tref − Tair(t)

where C = [1, 1]. Then, the system input is calculated with again Ko = [γ1, γ2] and integral constant γI in the followingequation.

Pdcpu(t) = −Ko

[Tcpu(t)Tenv(t)

]+ γIve(t) (12)

= −((γ1)Tcpu(t) + (γ2)Tenv(t)

)+γI

∫ t

0

(Tref − Tair(t)− Tcpu(t)− Tenv(t))dt.

We employ standard techniques from optimal control theoryto derive Ko and γI and prove stability. Details are presentedin an appendix of an extended version of this paper [11].

C. Active/Inactive Power Modes

Since the CPU power cannot be varied continuously, thecontroller designed in the previous section cannot be directlyapplied to the setting of discrete active/inactive power modes.In this section, we extend the design of the continuous powermodes controller described in the previous section to theactive/inactive power mode setting by applying pulse-widthmodulation (PWM) techniques. Recall in Section IV that westated the active/inactive power modes will be modeled via thethermal-aware periodic resource model with parameters Π andΘ. Thus, to control the system via this model, we must choosethe appropriate values of Π and Θ. The Π value is a designparameter which may be chosen at controller design-timeand will be assumed fixed throughout controller execution.Typically, a smaller value of Π will increase the systemschedulability; however, a larger value of Π will decreasethe overhead potentially incurred by switching between theactive and inactive power modes. (See Ahmed et al. [23] foralgorithms for determining Π in the thermal setting). The onlyconstraint that our framework places on the chosen value ofΠ is that it must evenly divide the sampling interval length Ts

(i.e., Ts = κΠ for some κ ∈ N+).Since we have only two power modes, we cannot arbitrarily

set the power level. However, we may change the assignedresource capacity between sampling periods to approximatearbitrary power levels. Therefore, the assigned resource ca-pacity will be the manipulated variable in our PWM system.Let Θ(k) denote the value of the resource capacity over thek’th sampling period. For determining the Θ(k) value, we usea method based on the principle of equivalent areas (PEA)for converting any arbitrary input signal into an equivalentPWM signal [30]. First, note that in a discrete-time systemusing zero-order hold (ZOH), the input signal is held constantover the sampling period. Specifically, for the k’th sampling

Θ(k)

Π

Θ(k) Θ(k) Θ(k + 1) Θ(k + 1) Θ(k + 1)

kth Sample (k + 1)th Sample (k + 2)th Sample

kth Sample(k + 1)th Sample (k + 2)th Sample

time

time

P (k) P (k + 1)P (k + 2)

Π

Fig. 4: The simplified power and modulation relationship

interval, the input Pdcpu(k) is held over the Ts-length interval,

resulting in a total energy dissipation of Ts · Pdcpu(k) over the

interval. To get the equivalent area (i.e., energy) as the (ideal)system with continuous power modes, we must set Θ(k) suchthat the periodic modulations between the power modes ofPact and Pinc dissipate the equivalent amount of energy overthe Ts-length interval. Figure 4 illustrates the area equivalencebetween the continuous and PWM controllers. The appendixof the extended version of this paper [11] describes how tochoose Ts to minimize error due to the PWM approximation.

More formally, we may derive the following relationshipbetween Pd

cpu(k) and Θ(k),

κΠPdcpu(k) = κ

(eact +

∫ Θ(k)

0

Pactdt+ einc +

∫ Π

Θ(k)

Pincdt)

⇒ Pdcpu(k) =

(Pact − Pinc

Π

)Θ(k) + Pinc +

1

Π(eact + einc).

(13)

Algorithm 1 Control Algorithm

Require: Reference Temperature Tref; Feedback Gain K ≡[γ1, γ2]; Integral Constant γI ; PWM Period Π; Numberof PWM periods in a sampling period κ.

1: while At beginning of sampling period [tℓ, tℓ+1) : tℓ ≡κℓΠ do

2: Sample Tcpu(tℓ) + Tenv(tℓ) + Tair(tℓ).3: ve(tℓ) = Tref − (Tcpu(tℓ) + Tenv(tℓ) + Tair(tℓ))

4: Tot ve(tℓ) = Tot ve(tℓ−1) + γIκΠ

(ve(tℓ)+ve(tℓ−1)

)2

5: Pcpu(tℓ) =(Tot ve(tℓ)−

(γ1Tcpu(tℓ) + γ2Tenv(tℓ)

))6: Θ(tℓ) = min

(Π× (Pcpu(tℓ)−Pinc)

Pact−Pinc,Π

)7: i = max{j ∈ Zq+1 | Θ(j) ≤ Θ(tℓ)}8: Update real-time performance mode to Mi.9: Set PWM to operate at period of Π and width of Θ(tℓ).

10: end while

The PWM controller pseudocode is presented in Algo-rithm 1. The controller proposed here consists of two inte-grated operations: the thermal controller and the PWM mod-ulator. The first step is to obtain the sample CPU temperature(Line 2 of Algorithm 1). The error is then calculated by takingthe difference between the reference temperature and the CPUtemperature (Line 3). The error is integrated into the error

vector and added to vector sum of the integrated error in thenext line (Line 4). After which, the power input is calculated(Line 5) and the equivalent Θ is calculated from the propertyof Equation (13) (Line 6). Finally, the appropriate mode isselected (Line 7), the mode change is performed (Line 8),and the pulse-width modulator is invoked for the next κ Π-length intervals (Line 9). It is important to note that Θ(tℓ)calculated in Line 6 does not have to be equal the Θ(j) forthe selected mode; we must only select the highest modewith Θ(j) ≤ Θ(tℓ). (If Θ(tℓ) is larger, we are only givingthe mode more processing than it requires.) It should also beobserved that all operations, except for finding the appropriatemode, may be done in O(1) time. Finding the highest real-timeperformance mode that may execute can be done in O(lg q)time (via binary search) where q is the number of real-timeperformance modes.

VI. THERMAL-RESILIENCY CALCULATION

In this section, we explain how to derive the real-timethermal resiliency Λ(Mi, Tref) for a given real-time perfor-mance mode Mi and reference temperature Tref. Assuminga steady-state error of zero, we will now briefly outline howto obtain a solution for Λ(Mi, Tref).3 Assume that we havereached the steady-state by the (k − 1)’th sampling period.Therefore, Tcpu(k) = Tcpu(k − 1), Tenv(k) = Tenv(k − 1),Tair(k) = Tair(k − 1), and Θ(k) = Θ(k − 1). Substitutingthe temperature equalities into Equations (6) and (7) allowsus to solve for Tcpu(k) and Tenv(k) to obtain a function ofTair(k), Tref, and Θ(k). Since we are interested in obtainingΛ(Mi, Tref), we may fix Tref and Θ(k) = Θ(i) Since the steady-state error is zero, we also have

Tref = Tcpu(k) + Tenv(k) + Tair(k). (14)

Combining Equation 14 with the function of Tair(k) obtainedfrom Tcpu(k) and Tenv(k) allows us to solve for Tair(k). Thus,solving the entire system results in a value for Tenv(k)+Tair(k)(i.e., value of Λ(Mi, Tref)). The resulting expression is quitecomplicated as it requires solutions to second-order inhomo-geneous equations. Full details and the closed-form expressionfor Λ(Mi, Tref) are provided in the extended paper [11].

VII. VALIDATION

In this section, we evaluate our control framework both insimulations and upon an experimental hardware testbed.

A. Simulations

In the simulations, we simulate the execution of a single-core processor which consists of a thermal controller, PWMfrequency controller loop, and scheduling algorithm. Thefollowing task parameters are used in our simulations:

• Each sporadic task τj = (ej , dj , pj) has a period pjuniformly drawn from the interval [5, 15]. (A small periodrange is used to keep LCM of periods from becoming too

3The approach may be generalized when there is bounded steady-state error.However, the approach will be similar, and we omit the details due to space.

TABLE I: Testbed Parameters

Parameter Variable ValueCPU Active Power Pact 73 WCPU Idle Power Pinc 20 WServer Period Π 20 msSampling Time Ts 100 msOptimal Feedback Ko

[.5725 0

]Q matrix in Performance Index Q

[1 00 1

]R matrix in Performance Index R

[1]

Integral Gain γI 0.00042

large). The execution time requirement ej set to the taskutilization times pj , where task utilization is calculatedusing the UUnifast algorithm [31]. For each task, djequals pj . The tasks are scheduled by EDF.

• The total number of tasks is eight; each task τj has threedifferent real-time performance modes where τ

(2)j =

(ej , dj , pj); τ(1)j = (.2ej , dj , pj); and τ

(0)j means that

task is not selected. From set of all possible combinationsof tasks, we have selected fifteen modes with utilizationsranging from zero to one.

We refer to the controller described in Algorithm 1 asTemperature Regulated Capacity Bound (TRCB). In our simu-lations, we closely compare the performance of our proposedmethod with [8] referred to as Thermal Control UtilizationBound (TCUB). TCUB has been chosen due to its low con-troller time complexity of O(1). TCUB works by attempting totrack a reference temperature and adjusting system utilizationas needed by changing task modes via a mode assignmentheuristic. The major difference between TCUB and TRCB isthat TCUB does not have predefined modes. Therefore, TCUBmay differ in the assigned modes from run to run for the samesystem temperature. Furthermore, TCUB does not use multiplepower levels. TRCB on the other hand has predefined modeswhich permit the derivation of thermal resiliency for eachmode. TCRB also utilizes a low-power mode (if available).

In our simulation, we use the same system parameters as ourtestbed (Intel Pentium IV 3.0 GHz). The pertinent power andcontrol parameters are given in Table I. Extensive testbed runswere carried out to generate the remaining system parametersusing SI. We use the SI tools provided by Matlab to derivethe system state-space parameters. Also we use the systemparameters, generated from our testbed as the simulationparameters. We observe a matching of our testbed readingsand the simulation. More details on this process are containedin the extended version of the paper [11].

In Figure 5, the system response and the utilization has beenshown for both TRCB (right graphs) and TCUB (left graphs)given a stable air temperature Tair temperature equal to 5◦C.The behavior of both controllers in this stable environmentis nearly identical for thermal and utilization behavior. (Thedifference is due to the fact that TRCB uses EDF and TCUBuses RM scheduling). For TRCB, we also display the achievedmodes at any given time in the simulation in the lower right

0 2000 4000 6000 8000 100000

20

40

60

80TRCB

Time

Tem

per

atu

re

CPU TemperatureReference TemperatureAir Temperature

0 2000 4000 6000 8000 100000

0.20.40.60.8

1

Uti

lizat

ion

Time

Instantaneous Modes

0 2000 4000 6000 8000 100000246810121416

Mo

de

0 2000 4000 6000 80000

20

40

60

80

Time

Tem

per

atu

re

TCUB

0 2000 4000 6000 80000

0.5

1

Time

Uti

lizat

ion

CPU Temperature

Utilization UtilizationMode

Fig. 5: Fixed Tair for Simulation. Left plots represent TCUB and rightplots represent TCRB.

0 5000 100000

20

40

60

80TRCB

Time

Tem

per

atu

re

CPU TemperatureReference TemperatureAir Temperature

0 5000 100000

0.5

1

Uti

lizat

ion

Time

Instantaneous Modes

0 5000 100000

10

20

Mo

de

UtilizationMode

0 5000 100000

20

40

60

80TCUB

Time

Tem

per

atu

re

0 5000 100000

0.5

1

Time

Uti

lizat

ion

Utilization

Air TemperatureCPU Temperature

Fig. 6: Dynamically Varying Tair for Simulation. Left plots representTCUB and right plots represent TCRB.

graph.Figure 6 shows the behavior of both TRCB and TCUB when

Tair is dynamically changed over time. In the top two graphs ofthe figure, the absolute CPU temperatures over time obtainedby TCUB and TRCB, respectively, are plotted along with theTair. The two bottom graphs of Figure 6 present the achievedutilization for each controller; additionally, the bottom rightgraph displays the active mode at any point in time for TRCB.Observe that both controllers are able to track the referencetemperature Tref despite the sharp changes in Tair. For bothcontrollers, the utilization appropriately tracks the changesin air temperature. When the air temperature increases, bothcontrollers decrease the system utilization and increase theutilization again when the air temperature drops. Similarly,the mode plot in the lower right graph tracks the temperaturechanges.

Regarding the real-time performance, figures displayingdeadline miss ratios have been omitted as no deadline misswas experienced for either controller in all the simulations.TCUB uses a safe utilization bound of approximately 67% tomake deadline misses improbably for rate-monotonic schedul-ing [26]. However, TCRB guarantees that no deadlines are evermissed due to verification using a multi-modal schedulabilitytest [27] as described in Section IV-B.

Thus far, the empirical performance of TRCB and TCUBmay appear similar. However, we believe the distinguishingfeature of TRCB is the ability to guarantee hard deadlines andto calculate thermal resiliency levels during design time. Ther-

30

35

40

50

60

70

80

900

5

10

15

Tenv

+ Tair

° C

Thermal Resiliency Function

Tref

° C

Mo

de

Fig. 7: Thermal resiliency over modes and Tref.

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

Testbed Run for Tref

=70 ° C

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

Testbed Run for Tref

=78 ° C

ΘModeT

air+T

env

Tair

+Tenv

+Tcpu

Tair

Fig. 8: The testbed running at different Tref values showing the Θand Mode change over the time

mal resiliency calculation provides a non-destructive thermalstress analysis for real-time performance modes in an unpre-dictable operating environment. Our approach has achievedthe ability to calculate the thermal resiliency by forcing thesystem to execute in a very predictable manner (i.e., periodicexecutions from PWM). To evaluate and illustrate our thermalresiliency calculation, we have used the technique in SectionVI to calculate the thermal resiliency levels for our randomly-generated multi-mode system. Figure 7 displays the thermalresilience Λ(Mi, Tref) for a range of modes and referencetemperatures. Observe that the thermal resiliency increaseswith decreasing modes or increasing Tref.

B. Experiments upon Hardware Testbed

To further confirm the validity of the theoretical results, wehave run a task system with eight tasks, each with three modes(identical to the simulation setting), on our hardware testbed.Each task performs numerical calculations while executingon the system. Our hardware testbed behaves similar to thesimulations of the previous subsection. Figure 8 presentstestbed runs for a fixed air and environment temperature.Figure 9 shows how the testbed behaves when an outsideheat source is dynamically introduced into the environment.Observe that there is a momentary drop in performance mode;however, the system soon stabilizes.

Finally, we validate our thermal resiliency calculation. Un-fortunately, we do not have test equipment to accurately varythe air or environment temperature. Thus, we consider the airtemperature to be fixed at the room temperature (in this caseTair = 24.8◦C). Instead, we indirectly analyze the thermal

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

Testbed Run for Tref

=78 ° C for Varying Conditions

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

Testbed Run for Tref

=61 ° C for Varying Conditions

0 50 100 1500

20

40

60

80

Time

Θ, M

od

e, T

emp

erat

ure

° C

ΘModeT

air+T

env

Tair

+Tenv

+Tcpu

Tair

Fig. 9: The testbed running at varying environmental conditionsshowing the Θ and Mode change over the time

2 4 6 8 10 12 1455

60

65

70

75

80

ModeT

ref=Λ

−1(M

i,Tai

r)

Inverse Thermal Resiliency Function for Fixed Tair

=24.8 ° C

Thermal Resiliency FunctionTestbed Run #1Testbed Run #2Testbed Run #3Testbed Run #4

Fig. 10: Thermal Resiliency for the Simulation.

resiliency function via the inverse of the thermal resiliencyfunction Λ−1(Mi, Tair) = min{Tref | Tair ≤ Λ(Mi, Tref)}.Intuitively, a lower value of Λ−1(Mi, Tair) means the systemcan operate at a lower temperature and thus is more resilientthan a higher value of the function. We have calculated thisfunction for four different runs of the hardware testbed (toensure that minor fluctuations of the air temperature do notaffect the system). Figure 10 shows a plot of the thermalresiliency of the testbed runs when the Tref is changed. Thefigure shows that the calculated inverse resiliency of the systemincreases with increasing operating mode. Most importantly,the calculated thermal resiliency tracks the actual behavior ofthe testbed and provides a safe upper bound on Tref in a largemajority of the cases which validates the effectiveness of theresiliency function.

VIII. CONCLUSIONS

In this paper, we have addressed the problem of obtainingperformance guarantees in an unpredictable thermal environ-ment. Towards this challenge we have presented a control-theoretic framework for thermal stress analysis in real-timesystems. Our proposed method employs a nested feedbackcontrol system, which is based on optimum control theory.For our system, we derive strong thermal-resiliency and hard-real-time guarantees for any real-time performance mode. Ourmethod has the distinct advantage of being able to verifythe real-time thermal resiliency of a system before it is putinto operation. In addition, we show via simulations that ourframework performs as well as previous approaches whichhave no formal guarantee on the thermal resiliency. Our im-

plementation upon a hardware testbed validates our proposedmodel and control framework.

In future work, we plan to extend our framework to controldesigns that are more robust to model inaccuracies (e.g., H∞or model-predictive controllers). As a initial step in designinga framework for thermal stress analysis, our current designuses two RC circuits (for dynamic and leakage currents) tomodel the CPU temperature. We plan on extending our modelto permit multiple RC circuits for heterogeneous thermaldistributions and generalizing our thermal equations for morecomplex RC circuit layouts. We hope to derive a general-theoretic design framework that captures “resiliency” metricsfor other system properties (e.g., energy, noise, etc.) and extendour analysis to other hardware settings (e.g., multicore, DVS).

ACKNOWLEDGMENTS

This research has been supported in part by the NSF (GrantNos. CNS-0953585, CNS-1116787, and CNS-1136007), theAir Force Office of Scientific Research (Grant No. FA9550-10-1-0210), and two grants from Wayne State University’s Officeof Vice President of Research.

REFERENCES

[1] J. Sergent and A. Krum, Thermal Management Handbook for ElectronicAssemblies. McGraw-Hill Professional, 1998.

[2] S. Kim, P. Tathireddy, R. Normann, and F. Solzbacher, “Thermal impactof an active 3-d microelectrode array implanted in the brain,” IEEETransactions on Neural Systems and Rehabilitation Engineering, vol. 15,no. 4, pp. 493–501, December 2007.

[3] G. Lazzi, “Thermal effects of bioimplants,” IEEE Engineering inMedicine and Biology Magazine, vol. 24, no. 5, pp. 75–81, September- October 2005.

[4] J. C. LaManna, K. A. McCracken, M. Patil, and O. J. Prohaska,“Stimulus-activated changes in brain tissue temperature in the anes-thetized rat,” Metabolic Brain Disease, vol. 4, no. 4, pp. 225–237, 1989.

[5] P. Ruggera, D. Witters, G. von Maltzahn, and H. Bassen, “In vitroassessment of tissue heating near metallic medical implants by exposureto pulsed radio frequency diathermy,” Physics in Medicine and Biology,vol. 48, no. 17, pp. 2919–2928, 2003.

[6] G. Kelly, “Body temperature variability (part 1): a review of the historyof body temperature and its variability due to site selection, biologicalrhythms, fitness, and aging,” Alternative Medicine Review, vol. 11, no. 4,pp. 278–293, 2006.

[7] N. Timmons and W. Scanlon, “An adaptive energy efficient mac protocolfor the medical body area network,” in 1st International Conference onWireless Communication, Vehicular Technology, Information Theory andAerospace Electronic Systems Technology, 2009, May 2009, pp. 587 –593.

[8] Y. Fu, N. Kottenstette, Y. Chen, C. Lu, X. D. Koutsoukos, and H. Wang,“Feedback thermal control for real-time system,” in Proceedings ofthe Real-Time and Embedded Technology and Applications SystemsSymposium. Stockholm, Sweden: IEEE Computer Society Press, April2010.

[9] X. Fu, X. Wang, and E. Puster, “Simultaneous thermal and timelinessguarantees in distributed real-time embedded systems,” Journal of Sys-tems Architecture, 2010, to Appear.

[10] R. Rajkumar, C. Lee, J. Lehoczky, and D. Siewiorek, “A resourceallocation model for qos management,” in Proceedings of the 18thIEEE Real-Time Systems Symposium, ser. RTSS ’97. Washington, DC,USA: IEEE Computer Society, 1997, pp. 298–. [Online]. Available:http://portal.acm.org/citation.cfm?id=827269.828990

[11] P. M. Hettiarachchi, N. Fisher, M. Ahmed, L. Y. Wang, S. Wang,and W. Shi, “The design and analysis of thermally-resilient hard-real-time systems (extended version),” Wayne State University, Tech.Rep., 2011, available at http://www.cs.wayne.edu/∼fishern/papers/thermal-control-rtas2012.pdf.

[12] A. K. Mok, “Fundamental design problems of distributed systems for thehard-real-time environment,” Ph.D. dissertation, Laboratory for Com-puter Science, Massachusetts Institute of Technology, 1983, available asTechnical Report No. MIT/LCS/TR-297.

[13] D. Brooks and M. Martonosi, “Dynamic thermal management for high-performance microprocessors,” in International Symposium on High-Performance Computer Architecture, 2001.

[14] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan,and D. Tarjan, “Temperature-aware microarchitecture,” in InternationalSymposium on Computer Architecture, 2003.

[15] N. Bansal and K. Pruhs, “Speed scaling to manage temperature,” inSymposium on Theoretical Aspects of Computer Science, 2005.

[16] S. Wang and R. Bettati, “Reactive speed control in temperature-constrained real-time systems,” Real-Time Systems Journal, vol. 39, no.1-3, pp. 658–671, 2008.

[17] J.-J. Chen, S. Wang, and L. Thiele, “Proactive speed scheduling forframe-based real-time tasks under thermal constraints,” in IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS),2009.

[18] G. Quan and Y. Zhang, “Leakage Aware Feasibility Analysis forTemperature-Constrained Hard Real-Time Periodic Tasks,” in Proceed-ings of the 2009 21st Euromicro Conference on Real-Time Systems-Volume 00. IEEE Computer Society, 2009, pp. 207–216.

[19] J.-J. Chen, C.-M. Hung, and T.-W. Kuo, “On the minimization of theinstantaneous temperature for periodic real-time tasks,” in IEEE Real-Time and Embedded Technology and Applications Symposium, 2007.

[20] T. Chantem, R. P. Dick, and X. S. Hu, “Temperature-aware schedulingand assignment for hard real-time applications on MPSoCs,” in Design,Automation and Test in Europe, 2008.

[21] N. Fisher, J.-J. Chen, S. Wang, and L. Thiele, “Thermal-aware globalreal-time scheduling on multicore systems,” in Proceedings of the 15thIEEE Real-Time and Embedded Technology and Applications Sympo-sium. IEEE Computer Society Press, April 2009.

[22] A. Ferreira, D. Mosse, and J. Oh, “Thermal faults modeling using arc model with an application to web farms,” in Proceedings of theEuromicro Conference on Real-Time Systems. IEEE Computer Society,July 2007.

[23] M. Ahmed, N. Fisher, S. Wang, and P. Hettiarachchi, “Minimizing peaktemperature in embedded real-time systems via thermal-aware periodicresources,” Sustainable Computing: Informatics and Systems, vol. 1,no. 3, pp. 226 – 240, 2011.

[24] I. Shin and I. Lee, “Compositional real-time scheduling framework withperiodic model,” ACM Transactions on Embedded Computing Systems,vol. 7, no. 3, April 2008.

[25] Intel Pentium 4 processor in the 423-pin package thermal designguidelines,. Intel Corp., 2000.

[26] C. Liu and J. Layland, “Scheduling algorithms for multiprogrammingin a hard real-time environment,” Journal of the ACM, vol. 20, no. 1,pp. 46–61, 1973.

[27] N. Fisher and M. Ahmed, “Tractable real-time schedulability analysisfor mode changes under temporal isolation,” in Proceedings of the9th IEEE Symposium on Embedded Systems for Real-Time Multimedia(ESTImedia). IEEE Computer Society, October 2011.

[28] Y. Liu, R. P. Dick, L. Shang, and H. Yang, “Accurate temperature-dependent integrated circuit leakage power estimation is easy,” inProceedings of the conference on Design, automation and test in Europe,Nice, France, 2007, pp. 1526–1531.

[29] K. Ogata, Discrete-time control systems (2nd ed.). Upper Saddle River,NJ, USA: Prentice-Hall, Inc., 1995.

[30] A. K. Gelig and . Churilov, Alexander N., Stability and oscillationsof nonlinear pulse-modulated systems / Arkadii Kh. Gelig, AlexanderN. Churilov. Boston : Birkhauser, 1998, includes bibliographicalreferences (p. [343]-359) and index.

[31] E. Bini and G. Buttazzo, “Biasing effects in schedulability measures,” inProceedings of the 16th Euromicro Conference on Real-Time Systems.IEEE Computer Society, 2004, pp. 196–203.

[32] R. C. Dorf and R. H. Bishop, Modern Control Systems. Upper SaddleRiver, NJ, USA: Prentice-Hall, Inc., 2000.

[33] N. S. Nise, Control Systems Engineering. New York, NY, USA: JohnWiley & Sons, Inc., 2000.

APPENDIX

A. Stability Analysis and Optimal State Feedback

In our derivation of stability for our system, we will use thefollowing two results which can be found in any standard texton control theory [29], [32], [33].

Lemma 1 (from [32]): The system of Equation (8) is com-pletely controllable if there exists an unconstrained u(t) suchthat it can control any initial state x(t0) to any desired finalstate xf in a finite time, t0 ≤ t ≤ T . The property ofcompletely controllable can be determined by examining thealgebraic condition

rank[B AB A2B ... Am−1B] = m, (15)

where, A is m×m and B is m× r matrix.Lemma 2 (from [29]): A discrete-time linear time invariant

(LTI) system is asymptotically stable if and only if its alleigenvalues of G lie inside the unit circle.

Now we derive the augmented system model that is used toobtain the optimality of the system. Equation (10) can be usedto describe the system dynamics at any time instance. Consideran instance where system is completely stable and has attainedto the steady state. We denote the system input, system states,and the servo error (described in Equation (11)) of this specialinstance of the system by Pcpu(t∞)), Tcpu(t∞), Tenv(t∞) andve(t) respectively. Therefore, we get,[

Tcpu(t∞)

Tenv(t∞)

]=

[−β1 kTσ1

kTσ2 −β2

] [Tcpu(t∞)Tenv(t∞)

]+

[σ1

σ2

]Pd

cpu(t∞) +

[0σ2

]Penv(t∞). (16)

Then, from the Equations (10) and (16) we get,[Tcpu(t)− Tcpu(t∞)

Tenv(t)− Tenv(t∞)

]=

[−β1 kTσ1

kTσ2 −β2

] [Tcpu(t)− Tcpu(t∞)Tenv(t)− Tenv(t∞)

]+

[σ1

σ2

](Pd

cpu(t)− Pdcpu(t∞)).

(17)

Also, from the Equation (11), we get,

ve(t)− ve(t∞) = −C

[Tcpu(t)− Tcpu(t∞)Tenv(t)− Tenv(t∞)

]. (18)

Now, combining the Equation (17) and (18), we define ourhigher order system as follows,Tcpu(t)− Tcpu(t∞)

Tenv(t)− Tenv(t∞)ve(t)− ve(t∞)

=

[A 0−C 0

]Tcpu(t)− Tcpu(t∞)Tenv(t)− Tenv(t∞)ve(t)− ve(t∞)

+

[B0

](Pd

cpu(t)− Pdcpu(t∞)). (19)

Define e(t),

e(t) =

Tcpu(t)− Tcpu(t∞)Tenv(t)− Tenv(t∞)ve(t)− ve(t∞)

, (20)

A =

[A 0−C 0

], (21)

and

B =

[B0

], (22)

then we get,e(t) = Ae(t) + Bue(t). (23)

We select the feedback gain γ such that,

ue(t) = Pdcpu(t)− Pd

cpu(t∞) (24)

= −Ke(t), (25)

where,

K =

[Ko

−γI

]T. (26)

The above state-space and the control gain parameters arevalid for a continuous-time controller. So, we may obtain thediscrete-time state-space matrices for the augmented model(i.e, G and H) from A and B via the transformation describedafter Equation (9). In LQ optimal control, the objective is todesign the controller to minimize some performance index. Astandard LQ performance index is given by

Jdef=

1

2

∞∑k=0

(e(k)TQe(k) + uT

e (k)Rue(k)), (27)

where Q and R are arbitrary symmetric matrices of size m×mand r × r such that Q ≥ 0 (positive semi definite), R > 0(positive definite). (In our system given in Equation (10), mis two and r is one). It is easy to show that for a Linear TimeInvariant (LTI) system, (Refer to Ogata [29]), the optimal statefeedback can be obtained as,

ue(k) = −Ke(k), (28)

where K is the feedback gain defined as

K = (R+HTPH)−1HTPG, (29)

and where P is the positive definite solution of the algebraicRiccati equation below,

P = Q+GTPG−GTPH(R+HTPH)−1HTPG.

From the above, it may be shown [29] that the optimalperformance index can be calculated as

Jmin =1

2eT (0)Pe(0). (30)

It is well known [29] that the feedback control (i.e., K) resultsin an asymptotically stable closed-loop system according toLemma 2. Obviously, stable choices of Ko and γI for theoriginal (non-augmented) system can be immediately obtainedfrom the derived K.

B. The Temperature Calculations

When we consider our thermal model with the leakagecurrent effect, the CPU temperature is calculated based onthe solution of second order differential equation. From Equa-tion (3), we get Tenv(t) and its first derivative as follows,

Tenv(t) =1

kTσ1

( d

dtTcpu(t) + β1Tcpu(t)− σ1Pd

cpu(t)− σ1kC),

(31)

d

dtTenv(t) =

1

kTσ1

( d2

dt2Tcpu(t) + β1

d

dtTcpu(t)

− σ1d

dtPd

cpu(t))

=1

kTσ1

( d2

dt2Tcpu(t) + β1

d

dtTcpu(t)

). (32)

In this analysis, we consider a system that can be describedaccording to the model shown in the Section IV. Therefore,in the above Equation (32), we consider the system behaviorfor discrete time intervals and the input is considered to beconstant in each sampling interval (the input value at thesampling time continue to hold for the rest of the period, untilthe next sampling time). This assumption is realistic becausewe implement our system as a discrete-time control system,in which the ZOH functionality means for holding the inputvalue for inter-sampling times periods. Let us consider anysuch general time period where the input is held constant;therefore, for time instant t in this range, d

dtPdcpu(t) can be

considered as zero. Thus, we can substitute Equation (31) and(32) to the Equation (5) to get the following,

d2

dt2Tcpu(t) + V d

dtTcpu(t) + BTcpu(t) = Fact/inc/cont,

(33)

where,

V def= (β1 + β2),

B def= (β1β2 − k2Tσ1σ2),

Factdef=

(β2σ1 + σ1σ2kT

)(Pact + kC) + σ1σ2kTPenv(t),

Fincdef=

(β2σ1 + σ1σ2kT

)(Pinc + kC) + σ1σ2kTPenv(t),

Fcontdef=

(β2σ1 + σ1σ2kT

)((Pact − Pinc)

Θ

Π+ Pinc + kC)

+ σ1σ2kTPenv(t).

(34)

The Equation (33) is a second-order inhomogeneous equa-tion and F is a constant (Pd

cpu(t) and Penv(t) are unchangedover two sampling periods). As we already discuss, the CPUcan operate in two power modes. Depending on the operatingmode of the system (active or inactive CPU operation), wecan derive two different F values. Also, we assume, when theCPU power is represented in terms of the resource capacity,

Θ, the corresponding F is denoted by Fcont.4. Therefore,the complete solution for Tcpu and Tenv over any continuousinterval is given by,

T actcpu(t) = C1acte

r1t + C2acter2t + C3act ,

T inccpu (t) = C1ince

r1t + C2incer2t + C3inc , (35)

where,

r1 = −1

2

(V −

√V2 − 4B

),

r2 = −1

2

(V +

√V2 − 4B

),

C3act =Fact

B ,

C3inc =Finc

B ,

C3cont =Fcont

B .

It is clear that we can derive three C values depending on thecorresponding F . In the Equation (35), the r1 and r2 termsare negative because

√V2 − 4B is positive and less than V .

From Equation (31) and (35), we can find the Tenv(t) foractive and inactive CPU operations as follows,

T actenv (t) =

1

kTσ1

(C1actr1e

r1t + C2actr2er2t

− σ1(Pact + kc))

+β1

kTσ1

(C1acte

r1t + C2acter2t + C3act

),

T incenv (t) =

1

kTσ1

(C1incr1e

r1t + C2incr2er2t

− σ1(Pinc + kc))

+β1

kTσ1

(C1ince

r1t + C2incer2t + C3inc

). (36)

We consider the system operates in interleaved active andinactive power modes over given interval size. Assume thatwe know the environment and the CPU temperature at thebeginning of each interval, Tcpu(tb) and Tenv(tb) are known andthe Pd

cpu and Penv are fixed throughout the interval, we mayobtain C1, C2 by solving the set of Equations (35) and (36),where tb is the time of the beginning of an interval. Then, we

4We justify the need for Fcont in the Section E

may obtain the expanded version of C as follows,

C1act(tb) =

1

r1 − r2

(r2C3act

+ σ1

(Pact + kC + kT Tenv(tb)

)− (β1 + r2)Tcpu(tb)

),

C2act(tb) =

1

r2 − r1

(r1C3act

+ σ1

(Pact + kC + kT Tenv(tb)

)− (β1 + r1)Tcpu(tb)

),

C1inc(tb) =

1

r1 − r2

(r2C3inc

(tb) + σ1

(Pinc + kC + kT Tenv(tb)

)− (β1 + r2)Tcpu(tb)

),

C2ina(tb) =

1

r2 − r1

(r1C3inc

+ σ1

(Pinc + kC + kT Tenv(tb)

)− (β1 + r1)Tcpu(tb)

),

C3act(tb) =

(β2σ1 + σ1σ2kT

)(Pact + kC ) + σ1σ2kT Penv(tb)

β1β2 − k2T

σ1σ2

,

C3inc(tb) =

(β2σ1 + σ1σ2kT

)(Pinc + kC ) + σ1σ2kT Penv(tb)

β1β2 − k2T

σ1σ2

,

C1cont(tb) =

1

r1 − r2

r2C3cont+ σ1(((Pact − Pinc)Θ + ΠPinc)

+kC + kT Tenv(tb))− (β1 + r2)Tcpu(tb)

,

C2cont(tb) =

1

r2 − r1

r1C3cont+ σ1(((Pact − Pinc)Θ + ΠPinc)

+kC + kT Tenv(tb))− (β1 + r1)Tcpu(tb)

,

C3cont(tb) =

(β2σ1 + σ1σ2kT

)(((Pact − Pinc)Θ + ΠPinc) + kC )

β1β2 − k2T

σ1σ2

+σ1σ2kT Penv(tb)

β1β2 − k2T

σ1σ2

.

(37)

Note that, here we replace the initial power settings Pdcpu(t)

with Pact.We now consider an adjacent interleaved heating and cool-

ing periods and assume in each period, the initial temperatureas the final temperature of the previous interval. We usethe Equation (35) and (36) to derive the temperature of thesystem at the end of the period. Therefore, consider the CPUtemperature at any active period, (nΠ, nΠ+Θ].

T actcpu(nΠ+Θ) = C1act(nΠ)er1Θ + C2act(nΠ)er2Θ

+ C3act(nΠ),

(38)

and the temperature at the boundary,

T actcpu(nΠ) = C1act(nΠ) + C2act(nΠ) + C3act(nΠ).

(39)

From the Equation (38) and (39) we get,

T actcpu(nΠ+Θ) = T act

cpu(nΠ) + C1act(nΠ)(er1Θ − 1)

+ C2act(nΠ)(er2Θ − 1) (40)

Similarly, for inactive CPU period, we can derive the follow-ing equations,

T inccpu ((n+ 1)Π) = C1inc(nΠ+Θ)er1(Π−Θ)

+ C2inc(nΠ+Θ)er2(Π−Θ) + C3inc(nΠ+Θ),

(41)

T inccpu (nΠ+Θ) = C1inc(nΠ+Θ)

+ C2inc(nΠ+Θ) + C3inc(nΠ+Θ).

(42)

From the Equation (40), (41), and (42) we get,

T inccpu ((n+ 1)Π) = T act

cpu(nΠ) + C1inc(nΠ+Θ)(er1(Π−Θ) − 1)

C2inc(nΠ+Θ)(er2(Π−Θ) − 1)

+ C1act(nΠ)(er1Θ − 1)

+ C2act(nΠ)(er2Θ − 1) (43)

Therefore, we can derive the equation for the period (nΠ, (n+ς)Π] is as follows.

T inccpu ((n+ ς)Π) = T act

cpu(nΠ)

+

ς−1∑i=0

C1inc((n+ i)Π + Θ)(er1(Π−Θ) − 1)

ς−1∑i=0

C2inc((n+ i)Π + Θ)(er2(Π−Θ) − 1)

+

ς−1∑i=0

C1act((n+ i)Π)(er1Θ − 1)

+

ς−1∑i=0

C2act((n+ i)Π)(er2Θ − 1),

= T actcpu(nΠ)

+

ς−1∑i=0

2∑j=1

(C(j)inc

((n+ i)Π + Θ)(er(j)(Π−Θ) − 1)

+ C(j)act((n+ i)Π)(er(j)Θ − 1)).

(44)

Please note that the above Equation (44) is inductively defined,as the constants for the boundary conditions can be derivedfrom the Equation (37) which are in terms of previous valuesof Tcpu and Tenv.

Now we use the same approach to derive the environmenttemperature Tenv as follows,

T actenv (nΠ+Θ) =

1

kTσ1

(C1act(nΠ)r1e

r1Θ + C2act(nΘ)r2er2Θ

− σ1(Pact + kc))+

β1

kTσ1

(C1act(nΠ)er1nΘ

+ C2act(nΠ)er2Θ + C3act(nΠ)),

(45)

and the temperature at the boundary,

T actenv (nΠ) =

1

kTσ1

(C1act(nΠ)r1 + C2act(nΠ)r2

− σ1(Pact + kc))+

β1

kTσ1

(C1act(nΠ)

+ C2act(nΠ) + C3act(nΠ)),

(46)

From the Equation (45) and (46) we get,

T actenv (nΠ+Θ) = T act

env (nΠ) +1

kTσ1

(C1act(nΠ)r1(e

r1Θ − 1)

+ C2act(nΠ)r2(er2Θ − 1)

)+

β1

kTσ1

(C1act(nΠ)(er1Θ − 1)

+ C2act(nΠ)(er2Θ − 1)

(47)

Similarly, for inactive period, we can derive the followingequation,

T incenv ((n+ 1)Π) =

1

kTσ1

(C1inc(nΠ+Θ)r1e

r1(Π−Θ)

+ C2inc(nΠ+Θ)r2er2(Π−Θ)

− σ1(Pact + kc))

+β1

kTσ1

(C1inc(nΠ+Θ)er1(Π−Θ)

+ C2inc(nΠ+Θ)er2(Π−Θ) + C3act(nΠ+Θ)).

(48)

Also, considering the temperature of the environment at the(nΠ, nΠ+Θ] boundary, we get the following equation,

T actenv (nΠ+Θ) =

1

kTσ1

(C1inc(nΠ+Θ)r1 + C2inc(nΠ+Θ)r2

− σ1(Pact + kc))+

β1

kTσ1

(C1inc(nΠ+Θ)

+ C2inc(nΠ+Θ) + C3act(nΠ)),

(49)

From the Equation (40), (45), and (39), we get,

T incenv ((n+ 1)Π) = T act

env (nΠ+Θ)

+1

kTσ1

(C1inc(nΠ+Θ)r1(e

r1(Π−Θ) − 1)

+ C2inc(nΠ+Θ)r2(er2(Π−Θ) − 1)

)+

β1

kTσ1

(C1inc(nΠ+Θ)(er1(Π−Θ) − 1)

+ C2inc(nΠ+Θ)(er2(Π−Θ) − 1). (50)

Therefore, we get the T incenv ((n + 1)Π) in terms of C and

T actenv (nΠ) as follows,

T incenv ((n+ 1)Π) = T act

env (nΠ)

+1

kTσ1

(C1inc(nΠ+Θ)r1(e

r1(Π−Θ) − 1)

+ C2inc(nΠ+Θ)r2(er2(Π−Θ) − 1)

)+

β1

kTσ1

(C1inc(nΠ+Θ)(er1(Π−Θ) − 1)

+ C2inc(nΠ+Θ)(er2(Π−Θ) − 1)

+1

kTσ1

(C1act(nΠ)r1(e

r1Θ − 1)

+ C2act(nΠ)r2(er2Θ − 1)

)+

β1

kTσ1

(C1act(nΠ)(er1Θ − 1)

+ C2act(nΠ)(er2Θ − 1). (51)

Therefore, we can derive the equation for the period

(nΠ, nΠ+ ς] is as follows,

T incenv ((n+ ς)Π) = T act

env (nΠ)

+

ς−1∑i=0

C1inc((n+ i)Π + Θ)(er1(Π−Θ) − 1)r1 + β1

kTσ1

ς−1∑i=0

C2inc((n+ i)Π + Θ)(er2(Π−Θ) − 1)r2 + β1

kTσ1

+

ς−1∑i=0

C1act((n+ i)Π)(er1Θ − 1)r1 + β1

kTσ1

+

ς−1∑i=0

C2act((n+ i)Π)(er2Θ − 1)r2 + β1

kTσ1

= T actenv (nΠ)

+

ς−1∑i=0

2∑j=1

(C(j)inc

((n+ i)Π + Θ)(er(j)(Π−Θ) − 1)r(j) + β1

kTσ1

+ C(j)act((n+ i)Π)(er(j)Θ − 1)r(j) + β1

kTσ1

).

(52)

In the Equation (52), the constants for the boundary conditionscan be derived from the Equation (37).

The CPU and environment temperature calculation equa-tions (Equation (44) and (52)) gives a possible way to calculatethe temperature states of the system, provided that we knowa single boundary condition. Therefore, when we calculatethe thermal resiliency and the PWM error for second-orderthermal model, Equation (44) and (52) are used.

C. Thermal Resiliency Calculation Details

Consider any sampling instance ζ. If the system has reachedthe stability at that time and the absolute CPU temperaturedoes not change with respect to the time, then we can calculatethe Tcpu and Tenv using Equations (44) and (52). Recall inthe steady-state that Tair(ζκΠ) = Tair((ζ + 1)κΠ) = . . . andΘ(ζκΠ) = Tair((ζ + 1)κΠ) = . . .. Also, we assume that thesystem has no steady-state error; thus, Equation (14) holds.(Please note that it can be shown if the system is stablewith respect to the absolute CPU temperature, then Tcpu andTenv also reach a steady-state and do not fluctuate). Using theresults shown in Equation (14), we can calculate the thermal-resiliency.

We first calculate the Tcpu as follows,

Tcpu((ζκ+ κ)Π) = Tcpu(ζκΠ)

+

κ−1∑i=0

2∑j=1

(C(j)inc

((ζκ+ i)Π + Θ)(er(j)(Π−Θ) − 1)

+ C(j)act((ζκ+ i)Π)(er(j)Θ − 1)).

(53)

According to the definition of the stability, the Tcpu(ζκΠ)value should not be changed during its stability region andtherefore, Tcpu((ζκ + κ)Π) and Tcpu(ζκΠ) values should bethe same. Further, when we consider the physical propertiesof the CPU cooling process, if the CPU does not increaseor decrease its temperature within a single sampling period,

then the CPU should maintain the same temperature in eachresource period Π intervals. Otherwise, this would mean thatthe absolute temperature has not converged to the steady stateas, according to the temperature equations, an increase in Tcpuor Tenv will result in an increase in temperature at successivestages for the same value of Θ. Therefore we can formulatethe following Equation for two adjacent resource periods,

T inccpu ((ζκ+ 1)Π) = T act

cpu(ζκΠ)

+

2∑j=1

(C(j)inc

(ζκΠ+Θ)(er(j)(Π−Θ) − 1)

+ C(j)act(ζκΠ)(er(j)Θ − 1)),

(54)

and conclude,

2∑j=1

(C(j)inc

(ζκΠ+Θ)(er(j)(Π−Θ) − 1)

+ C(j)act(ζκΠ)(er(j)Θ − 1))= 0,

(55)

because, T inccpu ((ζκ + 1)Π) = T act

cpu(ζκΠ) as per to the aboveargument. Then we futher simplify the Equation (55) asfollows,

⇒(C1inc(ζκΠ+Θ)(er1(Π−Θ) − 1) + C1act(ζκΠ)(er1Θ − 1)

)+

(C2inc(ζκΠ+Θ)(er2(Π−Θ) − 1) + C2act(ζκΠ)(er2Θ − 1)

)= 0,

⇒(GB + G4Tenv(ζκΠ+Θ)− G3Tcpu(ζκΠ+Θ)

)(er1(Π−Θ) − 1)

+(GA + G2Tenv(ζκΠ)− G1Tcpu(ζκΠ)

)(er1Θ − 1)

+(GD + G8Tenv(ζκΠ+Θ)− G7Tcpu(ζκΠ+Θ)

)(er2(Π−Θ) − 1)

+(GC + G6Tenv(ζκΠ)− G5Tcpu(ζκΠ)

)(er2Θ − 1)

= 0,

⇒ Tenv(ζκΠ+Θ)(G4(e

r1(Π−Θ) − 1) + G8(er2(Π−Θ) − 1)

)+ Tcpu(ζκΠ+Θ)

(− G3(e

r1(Π−Θ) − 1)− G7(er2(Π−Θ) − 1)

)+ Tcpu(ζκΠ)

(−G1(e

r1(Θ) − 1)− G5(er2(Θ) − 1)

)+ Tenv(ζκΠ)

(G2(e

r1(Π−Θ) − 1) + G6(er2(Θ) − 1)

)+ GA(e

r1Θ − 1) + GB(er1(Π−Θ) − 1)

+ GC(er2Θ − 1) + GD(er2(Π−Θ) − 1)

= 0,

⇒ Tenv(ζκΠ+Θ)(P4(Θ)

)+ Tcpu(ζκΠ+Θ)

(P3(Θ)

)+ Tcpu(ζκΠ)

(P1(Θ)

)+ Tenv(ζκΠ)

(P2(Θ)

)+ PA(Θ)

= 0 (56)

where,

P4(Θ) =(G4(e

r1(Π−Θ) − 1) + G8(er2(Π−Θ) − 1)

)P3(Θ) = Tcpu(ζκΠ+Θ)

(− G3(e

r1(Π−Θ) − 1)− G7(er2(Π−Θ) − 1)

)P1(Θ) = Tcpu(ζκΠ)

(−G1(e

r1(Θ) − 1)− G5(er2(Θ) − 1)

)P2(Θ) = Tenv(ζκΠ)

(G2(e

r1(Π−Θ) − 1) + G6(er2(Θ) − 1)

)PA(Θ) = GA(e

r1Θ − 1) + GB(er1(Π−Θ) − 1)

+ GC(er2Θ − 1) + GD(er2(Π−Θ) − 1).

(57)

In Equation (56), we use the definitions of C for ζκΠ and(ζκΠ+Θ) time instances as shown below,

C1act(ζκΠ)

=1

r1 − r2

r2C3inc(ζκΠ)+σ1 (Pact + kC + kTTenv(ζκΠ))

− (β1 + r2) Tcpu(ζκΠ)

=

1

r1 − r2

(GA + G2Tenv(ζκΠ)− G1Tcpu(ζκΠ)

),

(58)C1inc(ζκΠ+Θ)

=1

r1 − r2

r2C3inc(ζκΠ+Θ)+σ1 (Pinc + kC + kTTenv(ζκΠ+Θ))

− (β1 + r2) Tcpu(ζκΠ+Θ)

=

1

r1 − r2

(GB + G4Tenv(ζκΠ+Θ)− G3Tcpu(ζκΠ+Θ)

),

(59)C2act(ζκΠ)

=1

r2 − r1

r1C3inc(ζκΠ)+σ1 (Pact + kC + kTTenv(ζκΠ))

− (β1 + r1) Tcpu(ζκΠ)

=

1

r2 − r1

(GC + G6Tenv(ζκΠ)− G5Tcpu(ζκΠ)

),

(60)C2inc(ζκΠ+Θ)

=1

r2 − r1

r1C13inc(ζκΠ+Θ)+σ1 (Pinc + kC + kTTenv(ζκΠ+Θ))

− (β1 + r1) Tcpu(ζκΠ+Θ)

=

1

r2 − r1

(GD + G8Tenv(ζκΠ+Θ)− G7Tcpu(ζκΠ+Θ)

),

(61)

where,

GA =1

r1 − r2

(r2C3act(ζκΠ)+σ1 (Pact + kC)

),

G2 =1

r1 − r2

(σ1kT

),

G1 =1

r1 − r2

(−β1 − r2

),

GB =1

r1 − r2

(r2C3inc(ζκΠ+Θ)+σ1 (Pact + kC)

),

G4 =1

r1 − r2

(σ1kT

),

G3 =1

r1 − r2

(−β1 − r2

),

GC =1

r2 − r1

(r1C3act(ζκΠ)+σ1 (Pact + kC)

),

G6 =1

r1 − r2

(σ1kT

),

G5 =1

r1 − r2

(−β1 − r2

),

GD =1

r2 − r1

(r1C3inc(ζκΠ+Θ)+σ1 (Pact + kC)

),

G8 =1

r1 − r2

(σ1kT

),

G7 =1

r1 − r2

(−β1 − r2

).

Similarly, from the Equation (52), we can show that,

Tenv((ζκ+ κ)Π) = Tenv(ζκΠ)

+

κ−1∑i=0

2∑j=1

(C(j)inc

((ζκ+ i)Π + Θ)(er(j)(Π−Θ) − 1)r(j) + β1

kTσ1

+ C(j)act((ζκ+ i)Π)(er(j)Θ − 1)r(j) + β1

kTσ1

).

(62)

We consider the stability argument similar to the Tcpu case,the T act

env (ζκΠ) value should not be changed during its stabilityregion and therefore, T inc

env ((ζκ+ κ)Π) and T incenv (ζκΠ) values

should be the same. Further, when we consider the physi-cal properties of the environmental heating and the coolingprocess, during the stability the environment should maintainthe same temperature in each resource period Π boundaries.Otherwise, this would mean that the absolute temperaturehas not converged to the steady state as, according to thetemperature equations, an increase in Tcpu or Tenv will resultin an increase in temperature at successive stages for the samevalue of Θ.

Therefore,

Tenv((ζκ+ 1)Π) = Tenv(ζκΠ)

+

2∑j=1

(C(j)inc

(ζκΠ+Θ)(er(j)(Π−Θ) − 1)r(j) + β1

kTσ1

+ C(j)act(ζκΠ)(er(j)Θ − 1)r(j) + β1

kTσ1

),

(63)

and

2∑j=1

(C(j)inc

(ζκΠ+Θ)(er(j)(Π−Θ) − 1)r(j) + β1

kTσ1

+ C(j)act(ζκΠ)(er(j)Θ − 1)r(j) + β1

kTσ1

)= 0,

(64)

because, T incenv ((ζκ + 1)Π) = T act

env (ζκΠ) as per to the aboveargument.

We now further simplify the Equation (64) using the defi-nitions of C for ζκΠ and (ζκΠ+Θ) time instances as shownbelow,

⇒(C1inc(ζκΠ+Θ)(er1(Π−Θ) − 1)

+ C1act(ζκΠ)(er1Θ − 1))r1 + β1

kTσ1

+(C2inc(ζκΠ+Θ)(er2(Π−Θ) − 1)

+ C2act(ζκΠ)(er2Θ − 1)))r2 + β1

kTσ1= 0,

⇒(GB + G4Tenv(ζκΠ+Θ)

− G3Tcpu(ζκΠ+Θ))(er1(Π−Θ) − 1)(

r1 + β1

kTσ1)

+(GA + G2Tenv(ζκΠ)− G1Tcpu(ζκΠ)

)(er1Θ − 1)(

r1 + β1

kTσ1)

+(GD + G8Tenv(ζκΠ+Θ)

− G7Tcpu(ζκΠ+Θ))(er2(Π−Θ) − 1)(

r2 + β1

kTσ1)

+(GC + G6Tenv(ζκΠ)− G5Tcpu(ζκΠ)

)(er2Θ

− 1)(r2 + β1

kTσ1),

= 0

⇒ Tenv(ζκΠ+Θ)(G4(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

+ G8(er2(Π−Θ) − 1)(

r2 + β1

kTσ1))

+ Tcpu(ζκΠ+Θ)(− G3(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

− G7(er2(Π−Θ) − 1)(

r2 + β1

kTσ1))

+ Tcpu(ζκΠ)(−G1(e

r1(Θ) − 1)(r1 + β1

kTσ1)

− G5(er2(Θ) − 1)(

r2 + β1

kTσ1))

+ Tenv(ζκΠ)(G2(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

+ G6(er2(Θ) − 1)(

r2 + β1

kTσ1))

+ GA(er1Θ − 1)(

r1 + β1

kTσ1) + GB(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

+ GC(er2Θ − 1)(

r2 + β1

kTσ1) + GD(er2(Π−Θ)

− 1)(r2 + β1

kTσ1)

= 0,

⇒ Tenv(ζκΠ+Θ)(J4(Θ)

)+ Tcpu(ζκΠ+Θ)

(J3(Θ)

)+ Tcpu(ζκΠ)

(J1(Θ)

)+ Tenv(ζκΠ)

(J2(Θ)

)+ JA(Θ)

= 0 (65)

where,

J4(Θ) =(G4(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

+ G8(er2(Π−Θ) − 1)(

r2 + β1

kTσ1))

J3(Θ) =(− G3(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

− G7(er2(Π−Θ) − 1)(

r2 + β1

kTσ1))

J1(Θ) =(−G1(e

r1(Θ) − 1)(r1 + β1

kTσ1)

− G5(er2(Θ) − 1)(

r2 + β1

kTσ1))

J2(Θ) =(G2(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

JA(Θ) = G6(er2(Θ) − 1)(

r2 + β1

kTσ1))

+ GA(er1Θ − 1)(

r1 + β1

kTσ1) + GB(e

r1(Π−Θ) − 1)(r1 + β1

kTσ1)

+ GC(er2Θ − 1)(

r2 + β1

kTσ1) + GD(er2(Π−Θ) − 1)(

r2 + β1

kTσ1).

Further, we consider a CPU temperature for (ζκΠ, ζκΠ+Θ]within the stability region and find the following relationshipfrom the Equation (40),

T actcpu(ζκΠ+Θ) = T act

cpu(ζκΠ) + C1act(ζκΠ)(er1Θ − 1)

+ C2act(ζκΠ)(er2Θ − 1) (66)

Substituting values for the constants from Equation (58), weget,

T actcpu(ζκΠ+Θ) = T act

cpu(ζκΠ) +(GA + G2Tenv(ζκΠ)

− G1Tcpu(ζκΠ))(er1Θ − 1)

+(GC + G6Tenv(ζκΠ)

− G5Tcpu(ζκΠ))(er2Θ − 1)

⇒ T actcpu(ζκΠ+Θ) = T act

cpu(ζκΠ)(1− (er1Θ − 1)G1

− (er2Θ − 1)G5

)+ Tenv(ζκΠ)

((er2Θ − 1)G6 + (er1Θ − 1)G2

)+ (er1Θ − 1)(GA) + (er2Θ − 1)GC)

⇒ Tcpu(ζκΠ+Θ) = T actcpu(ζκΠ)

(P7(Θ)

)+ Tenv(ζκΠ)

(P8(Θ)

)+ P9(Θ),

(67)

where,

P7(Θ) =(1− (er1Θ − 1)G1

− (er2Θ − 1)G5

)P8(Θ) =

((er2Θ − 1)G6 + (er1Θ − 1)G2

)P9(Θ) = (er1Θ − 1)(GA) + (er2Θ − 1)GC).

(68)

Also, considering the environment thermal behavior fromthe Equation (51), we get,

T actenv (ζκΠ+Θ) = T act

env (ζκΠ) +1

kTσ1

(C1act(ζκΠ)r1(e

r1Θ − 1)

+ C2act(ζκΠ)r2(er2Θ − 1)

)+

β1

kTσ1

(C1act(ζκΠ)(er1Θ − 1)

+ C2act(ζκΠ)(er2Θ − 1).

(69)

Substituting values for the constants from Equation (58), weget,

T actenv (ζκΠ+Θ) = T act

env (ζκΠ)

+1

kTσ1

((GA + G2Tenv(ζκΠ)− G1Tcpu(ζκΠ)

)r1(e

r1Θ − 1)

+(GC + G6Tenv(ζκΠ)− G5Tcpu(ζκΠ)

)r2(e

r2Θ − 1))

+β1

kTσ1

((GA + G2Tenv(ζκΠ)− G1Tcpu(ζκΠ)

)(er1Θ − 1)

+(GC + G6Tenv(ζκΠ)− G5Tcpu(ζκΠ)

)(er2Θ − 1)

),

⇒ T actenv (ζκΠ+Θ) = Tcpu(ζκΠ)

(− r1 + β1

kTσ1G1(e

r1Θ − 1)

− r2 + β1

kTσ1G5(e

r2Θ − 1))

+ Tenv(ζκΠ)(1 +

r1 + β1

kTσ1G2(e

r1Θ − 1)

+r2 + β1

kTσ1G6(e

r2Θ − 1))

+(r1 + β1

kTσ1(GA(e

r1Θ − 1) +r2 + β1

kTσ1GC(e

r2Θ − 1))),

⇒ T actenv (ζκΠ+Θ) = Tcpu(ζκΠ)

(P10(Θ)

)+ Tenv(ζκΠ)

(P11(Θ)

)+

(P12(Θ)

), (70)

where,

P10(Θ) =(− r1 + β1

kTσ1G1(e

r1Θ − 1)

− r2 + β1

kTσ1G5(e

r2Θ − 1))

P11(Θ) =(1 +

r1 + β1

kTσ1G2(e

r1Θ − 1)

+r2 + β1

kTσ1G6(e

r2Θ − 1))

P12(Θ) =(r1 + β1

kTσ1(GA(e

r1Θ − 1) +r2 + β1

kTσ1GC(e

r2Θ − 1))).

(71)

Therefore, applying the Equations (56), (65), (67), and (70),in Equation (14), we get the thermal-resiliency as follows,

⇒ Tair = Tref − Tcpu(ζκΠ)− Tenv(ζκΠ)

= Tref −E1(Θ)

EN (Θ)− E2(Θ)

EN (Θ),

We may finally express our thermal-resiliency function interms of the fixed thermal constants and input Tref and Θ(i)

(which comes from the input mode Mi).

Λ(Mi, Tref) = Tref −E1(Θ

(i))

EN (Θ(i))− E2(Θ

(i))

EN (Θ(i)), (72)

where,

E1(Θ) = JA(Θ)P2(Θ) + J4(Θ)P12(Θ)P2(Θ)

+ JA(Θ)P11(Θ)P4(Θ)− J2(Θ)P12(Θ)P4(Θ)

+ JA(Θ)P3(Θ)P8(Θ) + J4(Θ)P12(Θ)P3(Θ)P8(Θ)

− J3(Θ)P12(Θ)P4(Θ)P8(Θ) + J3(Θ)P2(Θ)P9(Θ)

− J2(Θ)P3(Θ)P9(Θ)− J4(Θ)P11(Θ)P3(Θ)P9(Θ)

+ J3(Θ)P11(Θ)P4(Θ)P9(Θ)− J2(Θ)PA(Θ)

− J4(Θ)P11(Θ)PA(Θ)− J3(Θ)P8(Θ)PA(Θ),

E2(Θ) = −JA(Θ)P1(Θ)− J4(Θ)P1(Θ)P12(Θ)

− JA(Θ)P10(Θ)P4(Θ) + J1(Θ)P12(Θ)P4(Θ)

− JA(Θ)P3(Θ)P7(Θ)− J4(Θ)P12(Θ)P3(Θ)P7(Θ)

+ J3(Θ)P12(Θ)P4(Θ)P7(Θ)− J3(Θ)P1(Θ)P9(Θ)

+ J1(Θ)P3(Θ)P9(Θ) + J4(Θ)P10(Θ)P3(Θ)P9(Θ)

− J3(Θ)P10(Θ)P4(Θ)P9(Θ) + J1(Θ)PA(Θ)

+ J4(Θ)P10(Θ)PA(Θ) + J3(Θ)P7(Θ)PA(Θ),

EN (Θ) = J2(Θ)P1(Θ) + J4(Θ)P1(Θ)P11(Θ)

− J1(Θ)P2(Θ)− J4(Θ)P10(Θ)P2(Θ)

+ J2(Θ)P10(Θ)P4(Θ)− J1(Θ)P11(Θ)P4(Θ)

− J3(Θ)P2(Θ)P7(Θ) + J2(Θ)P3(Θ)P7(Θ)

+ J4(Θ)P11(Θ)P3(Θ)P7(Θ)

− J3(Θ)P11(Θ)P4(Θ)P7(Θ)

+ J3(Θ)P1(Θ)P8(Θ)− J1(Θ)P3(Θ)P8(Θ)

− J4(Θ)P10(Θ)P3(Θ)P8(Θ)

+ J3(Θ)P10(Θ)P4(Θ)P8(Θ).

(73)

D. PWM Error Derivation

We first derive the PWM error when the thermal modeldoes not include the leakage-current effect. We derive a boundon this error by comparing the resulting temperature at the(k+1)’th sample of both the continuous power mode controllerand the PWM-based controller assuming that the temperaturewas the same at the k’th sample. In both cases, we assumethe same design parameter such as Ts, Ko, and γI . Let T cont

cpu ,T cont

env , T pwmcpu , and T pwm

env be the temperature functions (CPUand environment) for the continuous power mode controllerand the PWM-based controller, respectively. We first derivethe resulting relative temperatures for the continuous powermode controller at the (k + 1)’th sample – i.e., T cont

cpu (k + 1)and T cont

env (k + 1). Note that we represent the sample value oftemperatures at the k’th sample by Tcpu(k) and Tenv(k) sincethe controllers’ state is assumed to be equal at that sample.Also we denote the final CPU and the environment temperaturecalculated by means of PWM regulation method by Tcpu(Ts)and Tenv(Ts) respectively.

T contcpu (k+ 1) =

Pcpu(k)

Ccpu

(1− e−Tsβ1)

β1+ Tcpu(k)e

−Tsβ1 , (74)

T contenv (k+1) =

(Penv(k) + Pcpu(k)

Cenv

)(1− e−Tsβ2)

β2+Tenv(k)e

−Tsβ2 .

(75)Similarly, we may derive the temperatures for the PWM-

based controller, noting that Θ(k) may be obtained from therelationship in Equation (13).

T pwmcpu (k + 1) =

κ∑i=1

(∫ iΠ

(i−1)Π+Θ(k)

PincCcpu

e−(κΠ−u)β1du

+

∫ (i−1)Π+Θ(k)

(i−1)Π

PactCcpu

e−(κΠ−u)β1du

)+Tcpu(k)e

−β1κΠ

=(1− e−κΠβ1 )

(1− e−Πβ1 )Ccpuβ1

×((1− e−(Π−Θ(k))β1 )Pinc

+e−(Π−Θ(k))β1 (1− e−β1Θ(k))Pact

)+Tcpu(k)e

−β1κΠ (76)

and

T pwmenv (k + 1)

=κ∑

i=1

(∫ iΠ

(i−1)Π+Θ(k)

(Pinc + Penv(k))

Cenve−(κΠ−u)β2du

+

∫ (i−1)Π+Θ(k)

(i−1)Π

(Pact + Penv(k))

Cenve−(κΠ−u)β2du

)+Tenv(k)e

−β2κΠ

=(1− e−κΠβ2 )

(1− e−Πβ2 )Cenvβ2

((1− e−(Π−Θ(k))β2 )(Pinc + Penv(k))

+e−(Π−Θ(k))β2 (1− e−β2Θ(k))(Pact + Penv(k))

)+ Tenv(k)e

−β2κΠ (77)

Therefore, the error can be calculated as,

Tcpu(κΠ) + Tenv(κΠ)− (Tcpu(Ts) + Tenv(Ts)) < ϵ, (78)

where ϵ is the error that we would like to tolerate. Considerthe Equation (78). Our error calculation approach is, first, wecalculate the equivalent power of the system in terms of Θ andfind a way to increase the accuracy of the equivalent powercalculation. Therefore, as far as κ value, the ratio betweenthe sampling period and the Π increases, the error should bereduced. However, the κ value cannot be increased indefinitelybecause, in the practical implementation, each size of the κrepresents the PWM switching frequency and each switchinginstance is associated with switching overhead. Therefore, weselect a κ value that is balance between the tolerable error andthe switching overhead.

E. PWM Error Derivation (using Second Order ThermalModel)

When we consider the leakage current of the model, theanalysis gets a little different as shown in the following

section. We derive a bound on this error by comparingthe resulting temperature at the (k + 1)’th sample of boththe continuous power mode controller and the PWM-basedcontroller assuming that the temperature was the same atthe k’th sample. In both cases, we assume the same designparameter such as Ts, Ko, and γI . Let T cont

cpu , T contenv , T pwm

cpu , andT pwm

env be the temperature functions (CPU and environment) forthe continuous power mode controller and the PWM-basedcontroller, respectively. We first derive the resulting relativetemperatures for the continuous power mode controller atthe (k + 1)’th sample – i.e., T cont

cpu (k + 1) and T contenv (k + 1).

Note that we represent the sample value of temperatures atthe k’th sample by Tcpu(k) and Tenv(k) since the controllers’state is assumed to be equal at that sample. Also we denotethe final CPU and the environment temperature calculated bymeans of PWM regulation method by Tcpu(Ts) and Tenv(Ts)respectively.

Therefore we calculate5,

T contcpu (k + 1) = Tcpu(k) + C1cont(k)e

r1Ts

+ C2cont(k)er2Ts + C3cont(k), (79)

and

T contenv (k + 1) = Tenv(k) +

1

kTσ1

(C1cont(k)r1e

r1Ts

+ C2cont(k)r2er2Ts

− σ1((Pact − Pinc)Θ(k) + PincΠ+ kc))

+β1

kTσ1

(C1cont(k)e

r1Ts + C2cont(k)er2Ts

+ C3cont(k)), (80)

where, C(j)cont(k), j ∈ {1, 2} is calculated from Equation 37.

Similarly, we may derive the temperatures for the PWM-based controller, noting that Θ(k) may be obtained fromthe relationship in Equation (13). When we calculate the thetemperature using the PWM method, the temperature shift foreach Π period needs to be calculated. Then, finally we can getthe resultant temperature by adding the temperature shift (tothe initial temperature).

Remind that we calculate the Tcpu((n + ς)Π) from theEquation (44) and the Tenv(n+ ς)Π from the Equation (52).

T inccpu ((n+ κ)Π) = T act

cpu(nΠ)

+

κ−1∑i=0

2∑j=1

(C(j)inc

((n+ i)Π + Θ)(er(j)(Π−Θ) − 1)

+ C(j)act((n+ i)Π)(er(j)Θ − 1)).

(81)

5The T contcpu (k) is a notational abuse, it should be T cont

cpu (kTs).

T incenv ((n+ κ)Π) = T act

env (nΠ)

+

κ−1∑i=0

2∑j=1

(C(j)inc

((n+ i)Π + Θ)(er(j)(Π−Θ) − 1)r(j) + β1

kTσ1

+ C(j)act((n+ i)Π)(er(j)Θ − 1)r(j) + β1

kTσ1

).

(82)

Therefore, the error can be calculated as,

Γ(Θ,Π, κ, t)def= Tcpu(κΠ) + Tenv(κΠ)

− (Tcpu(Ts) + Tenv(Ts)) < ϵ, (83)

where ϵ is the error that we would like to tolerate. Considerthe Equation (83). Tol calculate the error tolerance, we fistcalculate the equivalent power of the system in terms ofΘ and then find a way to increase the accuracy of the Θcalculation. Therefore, as far as κ value, the ratio betweenthe sampling period and the Π increases, the error should bereduced. However, the κ value cannot be increased indefinitelybecause, in the practical implementation, each size of the κrepresents the PWM switching frequency and each switchinginstance is associated with switching overhead. Therefore, weselect a κ value that is balance between the tolerable error andthe switching overhead. Further, incase we want to obtain anprioritization based on some other criteria in the error value,we can differentiate the Equation (83) and obtain the requiredcondition.

F. Testbench Implementation Details

Our test bench is based on a critically modified IBMcompatible PC. The system needs two main modifications.First, the hardware modification for temperature measurements(sensor placement), and the second, system level modificationof the kernel and loadable kernel module development as user-kernel space communication mechanism.

The low power Intel P4 CPU we use does not have asystem developer interface to measure the on-die temperature(The Intel documentation says that a on-die sensor is present,however, they have not provided a system developer interfaceto measure the on-die temperature by means of softwaremethods as opposed to latest CPU families). Therefore, wefollow the procedure given in the Intel Documentation [25].We carefully place the T-type thermocouple on the CPU diewith a small penetration made by a precise milling machine asrecommended by the Intel. We use Phidgets 4-port temperaturesensor board to measure the environment, air, and the on-dietemperature. The Phidgets comes with the USB driver andit allows us to directly interface the sensors with the testbedsoftware.

We develop a loadable kernel module to activate and changethe frequency modulation level. Intel provides a Model Spe-cific Registers (MSR) to control the frequency modulationratio in the clock frequency and we select the higher and thelowest frequency modulation indices to emulate the low and

Thermal Sensor Driver

Frequency Modulation Driver

controllerTcpuTair, Tenv

Mode Selection

τ

Θ

CPU

T-type Sensor

MSR

O/S schedule

Θ

τi

Fig. 11: The implementation details of the testbed.

the higher power levels. We use 12.5% and 87.5% modulationratios in the IA32 CLOCK MODULATION MSR foractive and inactive power mode emulation. Also, we build oursystem using Linux 2.6.33.7.2-rt30 PREEMPT RT kernel.

The multi-threaded application we developed is based onLinux native posix thread libraries (NTPL). Our applicationconsists of two parts. Thread activator and a scheduler simu-lator. Our schedule simulator selects the EDF based jobs fromthe ready-queue and dispatches them into a thread activator.The thread activator consists of a very high priority thread(priority is set to higher than the threaded IRQ handlers),emulates the schedule tick in the Linux kernel in higherlevel abstraction. Similar to the Linux kernel scheduler tick,the thread activator sleeps until it wakes up accurately inthe scheduling boundaries. Our thread activator wakes up inunequal tick intervals to schedule jobs, raises the appropriatethread which should have the priority, and goes back tothe sleeps. The jobs are selected by the schedule simulatoraccording to EDF. This process repeats and the amount oftime allocates to each job depends on EDF and the total timedepends on the Θ given by the optimal controller.

G. Calculation of State-Space Parameters Using Testbed Re-sults

According to our thermal model, the testbed output corre-sponds to the temperature measurement (Tcpu + Tenv) and thatcan be measured using T-type thermocouple as explained in theAppendix F. However, the testbed input, which corresponds tothe equivalent CPU thermal input power cannot be measureddirectly. The closest measurable parameter is the the CPUinput power. Assuming the electrical power consumed by theCPU totally converts to any thermal energy, we measure theCPU input power and consider it as the equivalent thermalpower6.

In order to measure the CPU input power, we measurepower consumed by the 4-pin main-board ATX power connec-tor. We install two small value shunt resisters in the currentpath of the ATX power connector and measure the voltage dropacross it using National Instrument data acquisition interface,

6This assumption is realistic because in the CPU (any electrical circuit)the desired objective is to operate its switches. However each gate (inswitches) consume energy and generate heat. There is no any other energytransformation in an ordinary electrical circuit.

NI 9205. The NI 9205 does not provide a USB driver tobe connected with Linux which is our operating system.Therefore, we create an application interface in the Windowscomputer to connect with the testbed using the Ethernet. Inperiodic intervals, the testbed measures the temperature ofthe CPU, Environment and informs our NI 9205 interface torecord the ATX current readings. We calculate the the totalpower fed to the CPU, as the current drawn by CPU (throughthe NI measurements) and the voltage of the 4 wire ATXinterface are known.

We select a random workload which is sufficient to generaterequired thermal effect on the CPU. We run the testbed exper-iments for larger time period to generate extensive amount ofinput-output data. Then, the generated testbed data is used toderive the state-space parameters using standard tools providedby system-identification toolbox in Matlab. In particular, weuse Predictive Error Method (PEM) algorithm implementationin Matlab. Once we generate state-space parameters, we usethem in the rest of the simulations, such as resiliency predic-tion and the controller design.

We collect two sets of data from the testbed, one set togenerate the model parameters and the other set validatesthem. As a standard in system identification, the recommendedaccuracy of the model parameters depends on the application.For example, in some applications, the accuracy of the pa-rameters needs to be very high, however, in many controlapplications, such as systems with a larger time constants,although the accuracy of the parameters might be important,and the simplicity of the model may have a better importancewhen it comes to the implementation.

We observe that when we do the SI process, the thermaloutput of the CPU is not sufficient enough to make a accu-rately measurable temperature difference in the environment.Therefore, for the parameter generation purpose, we considerthe following: we use a first order CPU thermal-model, for theparameter generation, considering that the system environmen-tal temperature stays stable and the the thermal model of thesystem is considered as a differential model. In other words,the leakage power of the testbed is a constant for a giventemperature and, therefore when we consider the differentialmodel (the difference between any steady point to the currentpoint), the leakage power component need not to be consideredfor closer operational points.

When we consider the environment temperature is nearlystable over a sufficiently larger time period, we may get anormalized thermal model of the CPU as follows,

d

dtTcpu(t) = σ1

(kT − 1

Rlcpu

− 1

Rdcpu

)Tcpu(t)

+ kTσ1Tenv(t) + σ1Pdcpu(t) + σ1kC . (84)

Consider an another test-point (at tE) during our SI process

Fig. 12: The voltage variation of shunt resister for random workloadin our testbed, at high and low frequency modulation values duringthe SI process shown by NI LabView (prominent high and lowvoltage values (square wave) correspond to high and low frequencymodulation coefficient and the short voltage spikes, within the squarewave correspond to the software workload variation. Also, notethat the white and red color voltage traces correspond to two ATXchannels).

assuming the same environmental temperature, we find,

d

dtTcpu(tE) = σ1

(kT − 1

Rlcpu

− 1

Rdcpu

)Tcpu(tE)

+ kTσ1Tenv(tE) + σ1Pdcpu(tE) + σ1kC(85)

This gives us the ability to model the system as thedifferential system; therefore, the final system that we usedin the controller design and the parameter generation may beconsidered as,

d

dtTcpu(t) = ATcpu(t) +BPd

cpu(t), (86)

where, Tcpu(t) = Tcpu(t)−Tcpu(tE), and Pdcpu(t) = Pd

cpu(t)−Pd

cpu(tE).In our parameter generation process, we use the discrete

form of the above state-space Equation (86). As we shownearlier, the continuous-time state-space model can convertedto discrete-time state-space model and the following discretemodel is obtained,

Tcpu(k + 1) = GTcpu(k) +HPdcpu(k). (87)

This parameter generation can be considered as linearizationof our model at the operating points (at a particular environ-ment temperature point). In our future work, we will generatelinearized system parameters for a smooth operating regionsand will implement a gain scheduled controller.