[IEEE 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) - Changsha, China (2011.11.16-2011.11.18)] 2011IEEE 10th International

Thermal-Aware Scheduling of Critical Applications Using Job Migration andPower-Gating on Multi-Core Chips

Buyoung Yun, Kang G. ShinEECS Department

The University of MichiganAnn Arbor, MI, 48109-2121

{buyoung,kgshin}@eecs.umich.edu

Shige WangGeneral Motors Global R& D

Warren, MI [email protected]

Abstract—Multi-core System-on-Chip (SoC) has become apopular execution platform for many embedded real-timesystems. As CMOS transistors continue to shrink down tothe nanoscale regime, it becomes more susceptible to var-ious reliability threats mainly due to thermal hotspots. Toimprove the reliability of embedded real-time systems, dynamicthermal management (DTM) is required for mission/safety-critical applications running on a multi-core chip to avoidpossible thermal hazards while meeting the applications’ timingconstraints.

In this paper, we propose an efficient runtime thermal-aware scheduler (TAS) using job-migration and power-gatingtechniques to avoid the thermal hotspots on a multi-core chip.Before runtime, the TAS distributes the periodic real-timetasks to cores using the tasks’ execution profiles to balancethe utilization of functional units on the chip. At runtime, theTAS periodically monitors the core temperatures and triggersone of the following pre-defined thermal management schemesdepending on the level of the measured core temperature: (i)migrating jobs running on hot cores to other cooler cores toreduce the workloads on the hot cores, or (ii) turning off hotcores for a certain period of time to cool them down. All tasks’timing constraints are guaranteed during the job migrationsand powering cores on/off. Our in-depth evaluation has shownthat the proposed TAS can effectively minimize the thermalhotspots on a multi-core chip without violating any applicationtiming constraint.

I. INTRODUCTION

The state-of-the-art CMOS scaling-down technology en-ables multi-core chips to be configured with more processingunits on a single die to achieve high performance. Thisaggressive scaling-down of CMOS to the nano-regime will,however, increase power densities, generating more hotspotson the chip. High temperature is well-known as the mainfactor to accelerate the degradation of chips in electronic de-vices. High on-chip temperature further affects the functionaland timing correctness of a chip [1], and will eventuallytrigger irrecoverable failures.

Hence, proper management of power consumption andtemperature on a multi-core chip is crucial to reduce therisk of chip failure caused by hotspots and to satisfy thenon-functional requirements of embedded real-time systems.Conventional ways of chip temperature management using

a hardware cooling system have been shown less cost-effective because they incur the cost exponential to theamount of power dissipated on a chip [2]. Thus, use ofmulti-core chips to meet non-functional requirements, suchas reliability and timeliness, has gained significant interestrecently due mainly to their potentials for high performanceand reliability at a lower cost.

To address the reliability issue due to high temperature,numerous power and temperature management techniqueshave been proposed at the hardware and software lay-ers. Thermal-aware floor-planning is one used to preventhotspots on a chip at the architecture design time. Byestimating the hot micro-blocks and then separating eachof them from the rest [3], special treatment is designed andperformed during the chip design to avoid certain runtimethermal hazards. However, depending on the applicationsrunning on a chip—which may have different runtime andpower dissipation characteristics, it may not prevent somemicro-blocks on the chip from being overheated at runtimedue to the runtime dynamics. To handle this problem, asoftware technique, called Dynamic Thermal Management(DTM), has been proposed for power/temperature manage-ment. Among various DTMs, dynamic voltage and fre-quency scaling (DVFS) has been widely used in micro-processors [4], [5]. Nevertheless, its applicability to currentmulti-core systems has been limited for two reasons. First,it becomes less effective for power management of modernmulti-core chips implemented by nano-scale CMOS processtechnology, because of the increasing leakage power [6].Second, per-core DVFS can be applied to individual coresseparately only if the power/clock of a core can be adjustedindependently. This feature is rarely supported in currentmulti-core architectures. By contrast, job-migration requiresno additional hardware implementation [7]. In addition,power-gating is an well-known technique for saving leak-age power efficiently by inserting sleep transistors betweenactual ground and virtual ground [8], [9]. Although power-gating induces overheads such as extra circuits and/or wake-up delay and noise, applying to modern microprocessors isstill beneficial with a significant amount of leakage power

2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11

978-0-7695-4600-1/11 $26.00 © 2011 IEEE

DOI 10.1109/TrustCom.2011.148

1083

saved at a lower cost [10].In this paper, we propose a runtime thermal-aware

scheduling algorithm using the power-gating mechanismcombined with job-migration on a multi-core chip. It istriggered periodically to keep the maximum temperature ofeach core below a pre-specified threshold while meetingthe applications’ timing constraints. To control the thermalbehaviors of the cores efficiently, we first propose a heuristictask assignment scheme to distribute the workload on amulti-core chip based on the tasks’ execution profiles. Dur-ing runtime, the proposed TAS applies two different DTMschemes based on the measured core temperatures. When themeasured core temperature is below, but close to, the tem-perature threshold, the TAS migrates jobs executing on thehot core to cooler cores to reduce the workload on the core.When the temperature is above the threshold, the hot core isturned off, and if needed, the TAS migrates the running jobsto other cores to make a sufficient time slack on the hot corefor cooling. To guarantee the completion of migrated jobsbefore their deadlines, schedulability conditions are derivedand checked.

The rest of this paper is organized as follows. SectionII states the assumptions and the system models. SectionIII formulates the problem. Section IV presents the keycomponents of the proposed algorithm. Section V evaluatesthe performance of the algorithm using simulations. SectionVI discusses the related work, followed by the concludingremarks in Section VII.

II. SYSTEM MODELS AND PROBLEM STATEMENT

To formulate the problem and develop a runtime thermal-aware scheduling algorithm on a multi-core platform, wedefine a set of models—including processor and chip model,task model, and scheduler model—of a given chip.

A. Processor and Chip Model

A multi-core system S with 𝑀 cores is represented byS = {𝑐𝑜𝑟𝑒1, 𝑐𝑜𝑟𝑒2, . . . , 𝑐𝑜𝑟𝑒𝑀}. We make the followingassumptions that will be used throughout the paper.

A1. The cores on a chip are symmetric, all with the samearchitectural configuration, the same computing capability,and the same access to devices, such as main memory, bus,and I/O.

A2. Based on the fact that most modern microprocessorssupport different levels of power states [11], each core inthe chip/system under consideration is assumed to have atleast three power modes: 1) active mode, 2) halt mode and3) deep sleep mode. In the active mode, the system operatesnormally. In the halt mode, a core does not execute anyinstruction, but the overhead for switching to the active modeis negligible. In the deep sleep mode, a core is completelyturned off using power-gating technique. Before a core beingturned off, the states of the core, such as registers, mustbe saved to the memory, and the core’s data cache must

be flushed. When a core is turned back on, its previously-saved states must be restored. These operations induce largelatencies necessary to guarantee the data consistency duringthe process of turning off a core, migrating jobs to the othercores, and resuming jobs after turning on the core.

A3. Each core can change its power state independentlyof others. Using the ARM11 MPCore chip as an example,placeholders for level-shifters and clamps can be insertedaround each core so that a separate power domain is imple-mented for each core [12].

A4. On-chip temperature sensors are deployed on a multi-core chip and are used to measure the temperature atruntime. However, obtaining accurate temperature is difficultin practice due to the inaccuracy of sensors and their place-ment [13]. In this paper, we assume that a sufficient numberof accurate temperature sensors have been deployed on achip to minimize the measurement error. Thus, the thermalhotspots can be detected, and the maximum temperaturein each core can be acquired via the on-chip temperaturesensors.

B. Task Model

The task system under consideration is composed of 𝑁independent periodic tasks T = {𝜏1, 𝜏2, . . . , 𝜏𝑁}. Each task𝜏𝑖 is modeled as 𝜏𝑖 = (𝑒𝑖, 𝑝𝑖, �⃗�𝑖) where 𝑒𝑖 is the worst-caseexecution time, 𝑝𝑖 is the period of 𝜏𝑖, and �⃗�𝑖 is the task’sexecution profile vector. Each task 𝜏𝑖 is invoked every 𝑝𝑖time units, and every invocation of a task is called a job.We assume that the relative deadline of a task is equal toits period, meaning that each job must be completed beforethe next job of the same task is released.

In addition, each task accesses the functional units of acore differently during its execution. Such task executionprofile affects the thermal distribution on a core, so thetask execution profile of each task 𝜏𝑖 is modeled as �⃗�𝑖.Each element in �⃗�𝑖 represents the average utilization ofits corresponding functional unit during the task execution.The size of �⃗�𝑖 is the number of functional units of a core.To obtain �⃗�𝑖, a runtime methodology using performancemonitoring counters can be applied [14]. Since most em-bedded real-time processors do not have sufficient numberof performance monitoring counters to support this feature,in this paper, we use a modified microprocessor simulator,namely SimpleScalar [15], to obtain the total number ofaccesses of each functional unit during the task execution,then normalize its value by the number of CPU cyclesto complete the task execution. Considering the utilizationof functional units varying with the change of the task’sinput, the average value of the results with various taskinputs is taken as �⃗�𝑖. For example, the estimated executionprofile of the well-known embedded testbench programs [16]are shown in Fig. 1. Once the task execution profiles areobtained, they are used when the tasks are allocated on cores,

1084

or the assigned workload needs to be re-distributed to othercores to control the high temperature at runtime.

��

��

��

��

�

��

� ��

�

Figure 1. Different execution characteristics of embedded testbenchprograms.

C. Scheduler Model

In this paper, each core is assumed to have its own localscheduler, and there exists one TAS in S. Before runtime,the tasks in T are distributed on the cores in S so as to meetthe schedulability condition on every core. During runtime,the local scheduler and the TAS perform the followingoperations.

∙ The local scheduler on a core executes the released jobsusing the earliest-deadline-first (EDF) scheduling pol-icy. When there is no backlogged workload in its queue,a core is entered into the halt mode. Its power modeis back to the active mode when new jobs are releasedor the job is interrupted by the TAS. In addition, thelocal scheduler keeps track of the completed portion ofeach running job. The information is used by the TASto calculate the available time slack on the hot core forturning off and to evaluate the schedulability conditionsfor job migrations.

∙ The TAS is invoked every 𝑝𝑡 time units and executedon a randomly selected core in S. Also, 𝑝𝑡 should bechosen at the system start time and must be a divisorof 𝐿𝜏 , which is the least common multiple of all taskperiods. It monitors the temperature of all cores. Ifthe core temperature is close to the given threshold,it migrates the jobs running on the hot core to othercooler cores without changing its power mode. If thecore temperature is already over the given threshold, itsuspends the execution of the hot core and turn offthe core during a cooling interval if the time slackis sufficient to cool off the core. Otherwise, the job-migration is used again to make a sufficient coolinginterval.

III. PROBLEM STATEMENT

Our thermal-aware scheduling algorithm is designed withconsideration of two constraints: (1) meeting all timingconstraints of the given task set to ensure the real-time

performance, and (2) maintaining the maximum temperatureon each core below the given temperature threshold throughthe system execution time.

The DTM can be triggered either proactively or reactively.Proactively-triggered DTM requires the thermal model of amulti-core chip to predict the thermal behavior of the targetcore. Estimating the accurate thermal model is impracticaldue to unknown runtime parameters, so imprecision of thethermal model may degrade the performance of DTM. Forthis reason, we use a simple reactive scheme in the TAS.

First, let Θ𝑖(𝑡) denote the maximum temperature of 𝑐𝑜𝑟𝑒𝑖measured at time t = 𝑡. The temperature threshold Θ𝑡ℎ𝑟𝑒𝑠

and the threshold margin Δ𝑡ℎ𝑟𝑒𝑠 (> 0) are given as athermal constraint and a design parameter, respectively. Toeffectively control the temperature on a multi-core chip dur-ing runtime, the TAS applies a hybrid thermal managementscheme at time t = 𝑡 as follows:

∙ Job-migration: When Θ𝑡ℎ𝑟𝑒𝑠 − Δ𝑡ℎ𝑟𝑒𝑠 ≤ Θ𝑖(𝑡) <Θ𝑡ℎ𝑟𝑒𝑠, the core temperature can be reduced by re-distributing the assigned workload on 𝑐𝑜𝑟𝑒𝑖 to othercooler cores.

∙ Turn-off with Job-migration: The leakage power isreduced significantly in the deep sleep mode. So, whenΘ𝑖(𝑡) > Θ𝑡ℎ𝑟𝑒𝑠, the core can be cooled down quicklyby turning it off for a sufficient time interval. Whenthere is no sufficient slack time for cooling, job migra-tion is applied.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� ! ��

"�#$��%��%%��

��

��

�%��&�%%

Figure 2. Power mode transition diagram.

Using the proposed thermal management scheme, thepower mode of a core changes as shown in Fig. 2. Usingthe previously-defined task, processor, and scheduler mod-els, our thermal-aware scheduling problem can be formallydefined as follows.

Given a set of tasks T, a multi-core systemS, a temperature threshold Θ𝑡ℎ𝑟𝑒𝑠, and a set oftiming constraints 𝐶𝑇 expressed in terms of therelative deadlines of independent, periodic tasks𝐶𝑇 = {𝑝1, 𝑝2, . . . , 𝑝𝑁}, find a run-time algorithmexecuting every 𝑝𝑡 time units in S such that (i) theproposed thermal management scheme is appliedto the overheated cores accordingly and (ii) 𝐶𝑇 isalways guaranteed during these operations.

The proposed algorithm to solve this problem is a run-timeone built in the TAS on a given multi-core chip. It is executedwith period 𝑝𝑡 and applies the proposed thermal management

1085

scheme according to the measured core temperatures. Thekey to meet the constraints is the estimation of the availableslack on the core, turning off the core, and migrating therunning jobs to the other cores without violating constraint.It may not be possible to meet both the thermal and thetiming constraints in some cases due to insufficient hardwareresource and the tight timing characteristics of the giventask set. In such a case, the proposed algorithm treats theconstraint (i) as soft, but always treats the constraint (ii) ashard.

IV. THERMAL-AWARE SCHEDULER

Before the system runs, the TAS assigns the periodic tasksto the cores using their execution profiles to produce uniformtemperature distribution across all cores. Since the algorithmis used online, it is must be efficient and incur a smallruntime overhead. For this purpose, the information neededfor the slack estimation and schedulability checks in case ofjob migration is calculated and stored in the lookup tablesbefore the system starts. Using this information, we willconstruct efficient slack estimation, schedulability conditionsfor job migration, and the temperature management schemesdescribed in detail below.

A. Task Assignment

To avoid the hotspot, the usage of each functional unitof the cores should be evenly distributed. Let 𝑙𝑗(𝑖) denotethe 𝑖-th element of task 𝜏𝑗’s execution profile vector and 𝑁𝑙

denote the total number of functional units on a core underconsideration. The sum of utilization of the 𝑖-th functionalunit on 𝑐𝑜𝑟𝑒𝑘 is defined as:

𝑈𝑘(𝑖) =∑

𝜏𝑗∈𝑐𝑜𝑟𝑒𝑘

𝑒𝑗 ⋅ 𝑙𝑗(𝑖)𝑝𝑗

. (1)

Using Eq. (1), the utilization variation of the 𝑖-th func-tional unit on a multi-core chip is defined as:

𝑉 (𝑖) = max𝑐𝑜𝑟𝑒𝑘∈S

𝑈𝑘(𝑖)− min𝑐𝑜𝑟𝑒𝑘∈S

𝑈𝑘(𝑖). (2)

It is well-known that real-time task assignment on a multi-core system is NP-hard. Therefore, a simple heuristic thatminimizes

∑𝑁𝑙

𝑖=1 𝑉 (𝑖) after the assignment while maintain-ing the sum of the tasks’ utilizations on each core being lessthan one is necessary when the TAS allocates each task ona core.

B. Creating Lookup Table

The key components in our thermal managements are:(1) an efficient time slack estimation on each core, and (2)the schedulability conditions that account for job migration.After the task assignment, but before runtime, the TAScomputes/obtains the necessary information for time slackcalculation and schedulability checks, then store them ina lookup table to reduce the computational overhead at

runtime. This step is based on the fact that (i) the tasks areassigned a priori before runtime, (ii) the slack is calculatedonly every 𝑝𝑡 time units by the TAS, and (iii) 𝑝𝑡 is a divisorof 𝐿𝜏 . Since the tasks in our system are independent andperiodic, together with the above conditions (ii) and (iii), itis sufficient for our algorithm to create a feasible scheduleworking for the period of [0, 𝐿𝜏 ).

For all 0 ≤ 𝑘 < 𝐿𝜏/𝑝𝑡, our algorithm calculates the starttime of the first schedule segment on 𝑐𝑜𝑟𝑒𝑖 that appears aftert = (𝑘+1)⋅𝑝𝑡, denoted by 𝑁𝐵𝑖(𝑘), assuming that the actualexecution time of every task is the same as the worst-caseexecution time at runtime.

The schedulability for the tasks assigned to each coreis guaranteed to meet the task’s deadline when the tasksare allocated on the cores. However, the execution of thetasks assigned to a core can be interrupted and delayedby our temperature management schemes periodically whenturning it off and/or executing jobs migrated from the othercores. So, the schedulability of the tasks on that core needsto be checked again whenever the thermal management isactivated. To reduce such runtime overhead, the TAS triggersthe thermal management service in a way such that theexecutions of the jobs released after 𝑁𝐵𝑖(𝑘) not beingaffected by the thermal management applied before timet = 𝑘 ⋅ 𝑝𝑡. Since it is time-consuming to obtain 𝑁𝐵𝑖(𝑘)at runtime, an iterative method is used to calculate and storethem for runtime usage.

To apply the iterative method, we define 𝑊𝑖(t) as afunction of the total remaining workload in the job queueon 𝑐𝑜𝑟𝑒𝑖 at time t. Initially, 𝑊𝑖(0) =

∑∀𝜏𝑗∈𝑐𝑜𝑟𝑒𝑖

𝑒𝑗 and𝑁𝐵𝑖(𝐿𝜏/𝑝𝑡 − 1) = 𝐿𝜏 . Other 𝑁𝐵𝑖(𝑘)’s are obtained asfollows.

1) For given 𝑊𝑖(𝑡′) where 𝑡′ is the previous iteration at

which 𝑊𝑖(𝑡′) is calculated, let 𝑡 = min(𝑡1, (𝑘+1) ⋅ 𝑝𝑡)

where 𝑡1 = min∀𝜏𝑗∈𝑐𝑜𝑟𝑒𝑖(⌊𝑡′/𝑝𝑗⌋+ 1) ⋅ 𝑝𝑗 . Then, 𝑊𝑖(𝑡)is calculated from 𝑊𝑖(𝑡

′) by

𝑊𝑖(𝑡) = max[𝑊𝑖(𝑡′)− (𝑡− 𝑡′), 0] +

∑

𝜏𝑗∈𝑁𝑖(𝑡)

𝑒𝑗

where 𝑁𝑖(𝑡) is the set of tasks on 𝑐𝑜𝑟𝑒𝑖 whose new jobsare released at t = 𝑡. These procedures are repeated until𝑡 = (𝑘 + 1) ⋅ 𝑝𝑡.

2) 𝑁𝐵𝑖(𝑘) can be calculated directly if the end of the busyinterval at t = (𝑘 + 1) ⋅ 𝑝𝑡 is known. For this, wedefine another function 𝑊𝑖(𝑡, 𝑥) as the sum of 𝑊𝑖(𝑡)and the total workload generated during the interval(𝑡, 𝑡+ 𝑥]. Using Eq.(3), 𝑊𝑖(𝑡, 𝑥) is updated iterativelyuntil 𝑊𝑖(𝑡, 𝑥)= 𝑥 and 𝑡+𝑊𝑖(𝑡, 𝑥) < 𝐿𝜏 :

𝑊𝑖(𝑡, 𝑥) = 𝑊𝑖(𝑡) +∑

𝜏𝑗∈𝑐𝑜𝑟𝑒𝑖

𝑤𝑗(𝑡, 𝑥) (3)

where 𝑤𝑗(𝑡, 𝑥) is the total new workload generated by

1086

𝜏𝑗 during the interval (𝑡, 𝑡+ 𝑥], which is calculated by:

𝑤𝑗(𝑡, 𝑥) = (⌊ 𝑡+ 𝑥

𝑝𝑗⌋ − ⌊ 𝑡

𝑝𝑗⌋) ⋅ 𝑒𝑗 .

If 𝑥 is not found, 𝑁𝐵𝑖(𝑗) = 𝐿𝜏 for ∀𝑘 ≤ 𝑗 < 𝐿𝜏/𝑝𝑡and the algorithm is terminated.

3) The 𝑥 obtained in previous step is the length of theremaining busy interval at t = (𝑘+1)⋅𝑝𝑡. Then, 𝑁𝐵𝑖(𝑘)is calculated by:

𝑁𝐵𝑖(𝑘) = min𝜏𝑗∈𝑐𝑜𝑟𝑒𝑖

(⌊ (𝑘 + 1) ⋅ 𝑝𝑡 + 𝑥

𝑝𝑗⌋+ 1) ⋅ 𝑝𝑗 .

After all 𝑁𝐵𝑖(𝑘) (1 ≤ 𝑘 < 𝐿𝜏/𝑝𝑡, 1 ≤ 𝑖 ≤ 𝑀 )are calculated, they are stored in the lookup tables beforeruntime. Thus, 𝑂(𝑀 ⋅𝐿𝜏/𝑝𝑡) memory space is required forcreating the lookup table.

C. Estimate Slack on Core

When the TAS is invoked at time t = 𝑡, and the measuredtemperature on 𝑐𝑜𝑟𝑒𝑖 is over the temperature threshold, theTAS should estimate the available time slack of 𝑐𝑜𝑟𝑒𝑖 forpower down using the lookup tables as follows.1) First, the TAS obtains current running jobs from the job

queue on 𝑐𝑜𝑟𝑒𝑖 as well as jobs which will be releasedafter t = 𝑡 and executed before 𝑁𝐵𝑖(𝑘) on each 𝑐𝑜𝑟𝑒𝑖where 𝑘 = 𝑡/𝑝𝑡. Let 𝑆𝐽𝑖(𝑡) denote the set of these jobson 𝑐𝑜𝑟𝑒𝑖 and the elements of 𝑆𝐽𝑖(𝑡) are sorted in anincreasing order of their deadlines. For convenience, let𝐽𝑗 denote a job with the 𝑗-th smallest deadline in 𝑆𝐽𝑖(𝑡),and its absolute deadline, the worst-case execution timeand the released time are denoted as 𝑑𝑗 , 𝑒𝑗 , and 𝑟𝑗 ,respectively.

2) Let 𝑟𝑒𝑗(𝑡) denote the worst-case remaining executiontime of 𝐽𝑗 at time t = 𝑡 if 𝐽𝑗 ∈ 𝑆𝐽𝑖(𝑡) is currentlyrunning. It is obtained by subtracting the completedportion of 𝐽𝑗 from its worst-case execution time, 𝑒𝑗 .Otherwise, 𝑟𝑒𝑗(𝑡) = 𝑒𝑗 when 𝑟𝑗 > 𝑡. Then, the availableslack on 𝑐𝑜𝑟𝑒𝑖 at time t = 𝑡 is estimated as:

𝑆𝑖(𝑡) = min𝐽𝑗∈𝑆𝐽𝑖(𝑡)

{min(𝑑𝑗 , 𝑁𝐵𝑖(𝑘))− 𝑡−∑

𝐽𝑗′∈𝑆𝐽𝑖(𝑡)𝑑𝑗′≤𝑑𝑗

𝑟𝑒𝑗′(𝑡)}. (4)

Overall, the slack calculation on one core incurs𝑂(∣𝑆𝐽𝑖(𝑡)∣) runtime computational overhead.

D. Schedulability Conditions for Job Migration

The schedulability conditions are required when the TASdetermines which job to migrate and where to continue itsexecution without missing its deadline while guaranteeingall timing constraints of the tasks assigned on the targetcore. Let us assume that the TAS is triggered at time t = 𝑡(𝑘 = 𝑡/𝑝𝑡), and it migrates a job 𝐽𝑠 running on 𝑐𝑜𝑟𝑒𝑠 to𝑐𝑜𝑟𝑒𝑡.

To derive the schedulability conditions for a system withjob migration, the latency for migrating a job from one coreto another must be considered. First, when 𝐽𝑠 is migratedto 𝑐𝑜𝑟𝑒𝑡, its context must be saved into the memory. When𝑐𝑜𝑟𝑒𝑠 is turned off and enters into the deep sleep mode, theregisters and D-caches of 𝑐𝑜𝑟𝑒𝑠 should be flushed. These op-erations further increases the migration latency. Due to thesemigration latencies, 𝐽𝑠 cannot resume its execution on 𝑐𝑜𝑟𝑒𝑡immediately. Thus, these latencies should be accounted forin the schedulability checks with job migration.

Let 𝛿𝑚 denote the migration latency. To derive a suf-ficient condition to feasibly schedule 𝐽𝑠 on 𝑐𝑜𝑟𝑒𝑡, wefirst assume that 𝐽𝑠 cannot complete its execution before𝑚𝑖𝑛(𝑑𝑠, 𝑁𝐵𝑡(𝑘)), which is expressed as:

𝑡+ 𝛿𝑚 +∑

𝐽𝑗∈𝑆𝐽𝑡(𝑡)𝑑𝑗≤𝑑𝑠

𝑟𝑒𝑗(𝑡+ 𝛿𝑚) + 𝑟𝑒𝑠(𝑡)

> min(𝑑𝑠, 𝑁𝐵𝑡(𝑘)). (5)

Since 𝑟𝑒𝑗(𝑡+ 𝛿𝑚) ≤ 𝑟𝑒𝑗(𝑡) in the second term in Eq. (5),

𝑡+ 𝛿𝑚 +∑


𝑟𝑒𝑗(𝑡) + 𝑟𝑒𝑠(𝑡)

> min(𝑑𝑠, 𝑁𝐵𝑡(𝑘)). (6)

Therefore, the sufficient condition to feasibly schedule themigrated job 𝐽𝑠 on 𝑐𝑜𝑟𝑒𝑡 is:

𝑟𝑒𝑠(𝑡) + 𝛿𝑚 ≤min(𝑑𝑠, 𝑁𝐵𝑡(𝑘))− 𝑡−

∑


𝑟𝑒𝑗(𝑡). (7)

All timing constraints of the assigned tasks on 𝑐𝑜𝑟𝑒𝑡should not be violated as a result of the migration of 𝐽𝑠.For ∀𝐽𝑗 ∈ 𝑆𝐽𝑡(𝑡) with 𝑑𝑗 > 𝑑𝑠, the slack of 𝐽𝑗 being largerthan 𝑟𝑒𝑠(𝑡) + 𝛿𝑚 is sufficient. This condition is expressedas:

𝑟𝑒𝑠(𝑡) + 𝛿𝑚 ≤min(𝑑𝑗 , 𝑁𝐵𝑡(𝑘))− 𝑡−

∑

𝐽𝑖∈𝑆𝐽𝑡(𝑡)𝑑𝑖≤𝑑𝑗

𝑟𝑒𝑖(𝑡). (8)

The sufficient condition for meeting all timing constraintsof the assigned tasks on 𝑐𝑜𝑟𝑒𝑡 is:

𝑟𝑒𝑠(𝑡) + 𝛿𝑚 ≤ min𝐽𝑗∈𝑆𝐽𝑡(𝑡)𝑑𝑗>𝑑𝑠

{

min(𝑑𝑗 , 𝑁𝐵𝑡(𝑘))− 𝑡−∑

𝐽𝑖∈𝑆𝐽𝑡(𝑡)𝑑𝑖≤𝑑𝑗

𝑟𝑒𝑖(𝑡)}. (9)

Both Eqs. (7) and (9) are sufficient for schedulability withjob-migration. Their runtime complexity is 𝑂(∣𝑆𝐽𝑡(𝑡)∣).

1087

E. Hybrid Temperature Managements

Using the proposed runtime time slack estimation andschedulability check, the temperature management schemepresented in Section III can be applied to the hot coreswithout violating the timing constraints. First, we expressthe amount of backlogged workload on 𝑖-th functional unitof 𝑐𝑜𝑟𝑒𝑘 at time t = 𝑡 as follows:

𝑊 𝑖𝑘(𝑡) =

∑

𝐽𝑗∈𝑐𝑜𝑟𝑒𝑘𝑟𝑗≤𝑡,𝑟𝑒𝑗(𝑡) ∕=0

𝑟𝑒𝑗(𝑡) ⋅ 𝑙𝑗(𝑖).

The backlogged workload on 𝑐𝑜𝑟𝑒𝑘 at time t = 𝑡 isdefined as the following vector:

�⃗�𝑘(𝑡) = (𝑊 1𝑘 (𝑡),𝑊

2𝑘 (𝑡), ⋅ ⋅ ⋅ ,𝑊𝑁𝑙

𝑘 (𝑡)) (10)

where 𝑁𝑙 is the total number of functional units on acore. Because the job-migration latency and the latency toturn a core on and off are not negligible, we define severalmicroprocessor-dependent timing parameters as follows.

∙ 𝑡𝑚1: the worst-case time to save a task’s context intothe memory.

∙ 𝑡𝑚2: the worst-case time to flush the D-cache and savethe state of a core into the memory.

∙ 𝑡𝑜𝑛: the worst-case time to wake up a core and reloadthe saved state of a core.

We define C as a set of cores satisfying the following twoconditions.

1) Θ𝑘(𝑡) < Θ𝑡ℎ𝑟𝑒𝑠 −Δ𝑡ℎ𝑟𝑒𝑠.2) 𝑊 𝑗

𝑘 (𝑡) ≤ 𝑊 𝑗𝑎𝑣𝑒(𝑡) for ∀1 ≤ 𝑗 ≤ 𝑁𝑙 where �⃗�𝑎𝑣𝑒 is

the average backlogged workload of all cores in S attime t = 𝑡.

So, when the TAS is invoked at time t = 𝑡, if Θ𝑡ℎ𝑟𝑒𝑠 −Δ𝑡ℎ𝑟𝑒𝑠 ≤ Θ𝑘(𝑡) < Θ𝑡ℎ𝑟𝑒𝑠, the TAS migrates jobs runningon 𝑐𝑜𝑟𝑒𝑘 to any core in C satisfying Eqs. (7) and (9) with𝛿𝑚 = 𝑡𝑚1. The job migration is terminated until

1) the job queue on 𝑐𝑜𝑟𝑒𝑘 is empty, or2) no available core in C, or3) 𝑊 𝑗

𝑘 (𝑡) ≤𝑊 𝑗𝑎𝑣𝑒(𝑡) for ∀1 ≤ 𝑗 ≤ 𝑁𝑙.

The above temperature management scheme is calledDTM policy 1. Likewise, if Θ𝑘(𝑡) ≥ Θ𝑡ℎ𝑟𝑒𝑠, the TASmigrates jobs running on 𝑐𝑜𝑟𝑒𝑘 to any core in C satisfyingEqs. (7) and (9) with 𝛿𝑚 = 𝑡𝑚1 + 𝑡𝑚2. The job migrationis terminated until

1) the job queue on 𝑐𝑜𝑟𝑒𝑘 is empty, or2) no available core in C, or3) 𝑆𝑘(𝑡) ≥ 𝑝𝑡.Then, 𝑐𝑜𝑟𝑒𝑘 can be turned off during min(𝑝𝑡, 𝑆𝑘(𝑡)) −

𝑇𝑙𝑠𝑢𝑚 only if 𝑆𝑘(𝑡) > 𝑇𝑙𝑠𝑢𝑚 where 𝑇𝑙𝑠𝑢𝑚 = 𝑡𝑚1+𝑡𝑚2+𝑡𝑜𝑛.If 𝑐𝑜𝑟𝑒𝑘 is turned off, and there exist jobs of the assignedtasks on 𝑐𝑜𝑟𝑒𝑘 that are supposed to be released during theturn-off interval, their release times are put off till the coreis turned on after cooling. This temperature managementscheme is called DTM policy 2.

V. PERFORMANCE EVALUATION

The evaluation of our TAS has been conducted using awell-known thermal simulator, namely the HotSpot [17],with a given task set and timing constraints. The input mod-els, experiment setup, and simulation results are discussedin this section.A. Simulation Setup

The multi-core chip used in our evaluation were an 8-corechip, as shown in Fig. 3. Each core was set to operate at 800Mhz and 1.2 volt for Vdd. Alpha 21264 floorplan was usedas the core layout with its size scaled down, considering45nm CMOS process technology. The physical size of eachcore was set to 12.6853 𝑚𝑚2.

core 1 core 2 core 3 core 4

core 5 core 6 core 7 core 8

Figure 3. Layout of 8 cores on a multi-chip.

In creation of the task set T, each task was generated withits period uniformly distributed between [20, 200] msec,and its utilization is uniformly distributed between (0, 0.8].The tasks were generated one-by-one until the total systemutilization reached 0.8 ⋅𝑀 (𝑀 is the total number of cores).In the simulation, the actual execution time of each job wasset in a way that the actual execution time over its worst-caseexecution time is uniformly distributed between (0.2, 1.0].

After setting the parameters of all tasks, we randomlychose one of the testbench programs shown in Fig. 1 andassigned its dynamic power characteristics to each task. Thedynamic power dissipation of the testbench programs wereobtained using Wattch power simulator [18]. We also usedthe McPAT power simulator [19] to model the leakage powerconsumed in the experiments.

The convection resistance of the heat sink of the multi-core chip should be a reasonable value in our experiment.Based on the knowledge of existing systems, the convec-tion resistance is typically around 0.5˜2 K/W for the heatsinks used in desktop processors, and around 4˜8 K/Wfor mobile devices [20]. We, therefore, chose 3.5 K/W asthe convection resistance of the heat sink model in ourexperiments. Also, the ambient temperature was set to 45∘𝐶,while Θ𝑡ℎ𝑟𝑒𝑠 = 85∘𝐶 and Δ𝑡ℎ𝑟𝑒𝑠 = 3.0∘𝐶 were used in thesimulations. For schedulability checks for job migrations inthe simulation, 𝑡𝑚1, 𝑡𝑚2 and 𝑡𝑜𝑛 were set to 0.5, 2.0 and0.5 msec, respectively.

B. Observed Chip Temperature Behavior

We iteratively ran the randomly-generated tasks with themulti-core chip model to produce the power traces and ob-tained the temperature traces of the cores from the HotSpot

1088

thermal simulator. The metric used for the evaluation was atrace of the maximum core temperature on the given multi-core chip. For all the parameters we considered, the tem-peratures of all cores were sampled every 10 msec for fiveminutes of the simulation time after the core temperaturesstabilized.

Fig. 4(a) shows the performance results with differentvalues of 𝑝𝑡. As can be seen, when the frequency ofmonitoring core temperatures increase, the performance ofthe TAS increases as well. However, the runtime overheadis also increased, because of the increased number of theTAS interruptions and the increased size of lookup table.Overall, the percentage of the core temperatures lower thanthe temperature threshold were 94%, 86%, 81% and 80%respectively when 𝑝𝑡 = 10, 20, 30 and 40 msec.

Fig. 4(b) shows the performance results with 𝑝𝑡 = 20msec for “DTM policy 1 only” or “DTM policy 2 only”when the core temperature is above the temperature thresh-old. The percentage of the core temperatures lower than thethreshold was 84% and 81%, respectively. This experimentshowed that the performance gap between the hybrid schemeand “DTM policy 2 only” is small in terms of the percent-age of maximum core temperature under the temperaturethreshold. In practice, however, the number of turn-off andwake-up should be also minimized, because extra dynamicpower is dissipated, and noise may be induced when turningon and off a core. During the system execution time, thenumber of turning-off and wake-up operations triggered by“DTM policy 2 only” is about 18.7% higher than the numberof turning-off and wake-up by the hybrid scheme.

In addition, the measurement errors can occur. The errorsare mainly caused by inaccuracy amd wrong placement ofthe temperature sensors. To see the performance of the TASwith the measurement errors, a random noise in Gaussianfunction with the variance of 1∘𝐶 and 4∘𝐶 were added at themeasurement. The performance results with 𝑝𝑡 = 20 msecare shown in Fig. 4(c). Because of the incorrectly-measuredtemperatures, the TAS frequently forced unnecessary shut-downs of the cool cores, especially with large measurementerror variance (the noise with 4∘𝐶 variance). The number ofturn-off and wake-up was about 15.8% higher than when themeasurements were accurate, even though the performancewas degraded by 8.1%.

VI. RELATED WORK

A large body of research has been done on DTM usingpower-gating techniques. Most of them do not address thetiming constraints. Jiang et al. [10] studied the benefits, andthe costs of power-gating techniques in terms of power, chiparea and the amount of saved leakage power. Choi et al.[21] also studied power-gating techniques with the mainfocus being design of circuit-level synthesis technique tominimize the size of data retention storage to reduce theruntime overhead of power-gating.

For scheduling with power-gating, Xu et al. [22] de-veloped an online noise-aware scheduling algorithm withpower-gating techniques on a multi-core chip. Their workfocused on reducing the noise induced by turning on andoff a processor unit. Unlike previous work, Yuan et al. [23]proposed a thermal-aware scheduler using power-gating toguarantee the timing constraint of a single task executingon an uniprocessor. Applying it to multi-core systems withmultiple hard real-time tasks is difficult. Likewise, a hybridscheduling algorithm with DVFS and power-gating has beenproposed by Kang et al. [24], but subjects to the samelimitation.

For job migration, Donald and Martonosi [7] analyzedvarious DTM schemes, such as the stop-n-go scheme, andprocess migration along with distributed or global controlpolicies in a multi-core system. Powell et al. [25] developeda hardware approach to use thread migration in multi-threaded (SMT) multi-core chips. None of these approachesaddress the real-time timing constraints.

VII. CONCLUSIONS

In this paper, we present a simple, efficient runtime TASusing job migration and power-gating mechanism altogetherto meet both thermal (soft) and timing (hard) constraints.The TAS is based on a reactive mechanism to monitor thecurrent core temperatures periodically using on-chip tem-perature sensors, and determines when to re-distribute theworkload on the hot cores or to turn off the hot cores to meetthe thermal constraints. To avoid violation of the applicationtiming constraints, an online schedulability conditions forjob migrations is employed in the TAS algorithm.

We have further evaluated the effectiveness of our solu-tion using a simulation combined of several power/thermalsimulators, including HotSpot, Wattch and McPAT, on an8-core chip model with 45 nm technology. Our evaluationresults have shown that the proposed solution can controlthe maximum temperature of a given multi-core chip for aset of randomly-generated tasks with different power andtiming characteristics around the given threshold.

Given the encouraging results of our thermal-aware sched-uler with job migration and power-gating scheme, we plan tofocus on extensive evaluation for more complex cases withvarious hybrid DTM mechanisms.

REFERENCES

[1] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “Theimpact of technology scaling on lifetime reliability,” in IEEEComputer Society, 2004.

[2] Gunther, S., Binns, F., Carmean, D. M., and Hall, J. C.,“Managing the Impact of Increasing Microprocessor PowerConsumption,” Intel Technology Journal, vol. 5, no. 1, 2001.

[3] W.-L. Hung, Y. Xie, N. Vijaykrishnan, C. Addo-Quaye,T. Theocharides, and M. Irwin, “Thermal-aware floorplanningusing genetic algorithms,” in ISQED, 2005.

1089

��

��

��

�

��

��

��

�

��

��

��

(a) With different 𝑝𝑡

��

��

��

�

��

��

��

�

��

��

��

(b) With different DTM schemes

�

��

��

��

��

�

��

��

��

��

� ��

(c) With measurement errorsFigure 4. The cumulative distribution function of measured maximum temperatures on the 8-core chip model.

[4] S. Wang and R. Bettati, “Reactive speed control intemperature-constrained real-time systems,” in Real-Time Sys-tems, 18th Euromicro Conference on, 2006.

[5] P. Pillai and K. G. Shin, “Real-time dynamic voltage scalingfor low-power embedded operating systems,” in Proc. of the18th ACM symposium on Operating systems principles, ser.SOSP ’01, 2001, pp. 89–102.

[6] S. K. Barry Pangrle, “Leakage power at 90nm and below,”EE Times Asia, 2005.

[7] J. Donald and M. Martonosi, “Techniques for multicorethermal management: Classification and new exploration,”SIGARCH Comput. Archit. News, vol. 34, 2006.

[8] B. Calhoun, F. Honore, and A. Chandrakasan, “A leakagereduction methodology for distributed mtcmos,” Solid-StateCircuits, IEEE Journal of, vol. 39, no. 5, pp. 818 – 826, may2004.

[9] J. Kao, S. Narendra, and A. Chandrakasan, “Mtcmos hierar-chical sizing based on mutual exclusive discharge patterns,”in Design Automation Conference, jun 1998, pp. 495 – 500.

[10] H. Jiang, M. Marek-Sadowska, and S. R. Nassif, “Benefitsand costs of power-gating technique,” in Proc. of the 2005International Conference on Computer Design. IEEE Com-puter Society, 2005, pp. 559–566.

[11] ACPI, “Advanced Configuration Power Interface(http://www.acpi.info).”

[12] ARM Corp., “ARM11 MPCore (http://www.arm.com).”

[13] S. Sharifi and T. Rosing, “Accurate direct and indirect on-chip temperature sensing for efficient dynamic thermal man-agement,” Computer-Aided Design of Integrated Circuits andSystems, IEEE Trans. on, vol. 29, no. 10, 2010.

[14] C. Isci and M. Martonosi, “Runtime power monitoring inhigh-end processors: Methodology and empirical data,” inInternational Symposium on Microarchitecture, 2003.

[15] T. Austin, E. Larson, and D. Ernst, “Simplescalar: an infras-tructure for computer system modeling,” Computer, vol. 35,no. 2, 2002.

[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown, “Mibench: A free, commerciallyrepresentative embedded benchmark suite,” in Proceedingsof the Workload Characterization, 2001 IEEE InternationalWorkshop. IEEE Computer Society, 2001.

[17] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan,K. Skadron, and M. Stan, “Hotspot: a compact thermalmodeling methodology for early-stage vlsi design,” VLSISystems, IEEE Trans. on, vol. 14, no. 5, 2006.

[18] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A frame-work for architectural-level power analysis and optimiza-tions,” in Proceedings of the 27th Annual International Sym-posium on Computer Architecture, 2000.

[19] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, andN. Jouppi, “Mcpat: An integrated power, area, and timingmodeling framework for multicore and manycore architec-tures,” in Microarchitecture, 2009. 42nd Annual IEEE/ACMInternational Symposium on, 2009.

[20] R. Viswanath, V. Wakharkar, A. Watwe, V. Lebonheur,M. Group, and I. Corp, “Thermal performance challengesfrom silicon to systems,” 2000.

[21] E. Choi, C. Shin, T. Kim, and Y. Shin, “Power-gating-aware high-level synthesis,” in ISLPED, 2008 ACM/IEEEInternational Symposium on, 2008.

[22] Y. Xu, W. Liu, Y. Wang, J. Xu, X. Chen, and H. Yang,“On-line mpsoc scheduling considering power gating inducedpower/ground noise,” in Proc. of the 2009 IEEE ComputerSociety Annual Symposium on VLSI. IEEE ComputerSociety, 2009.

[23] L. Yuan, S. Leventhal, and G. Qu, “Temperature-aware leak-age minimization technique for real-time systems,” in Proc. ofthe 2006 IEEE/ACM international conf. on Computer-aideddesign, ser. ICCAD. ACM, 2006.

[24] K. Kang, J. Kim, S. Yoo, and C.-M. Kyung, “Temperature-aware integrated dvfs and power gating for executing taskswith runtime distribution,” Trans. Comp.-Aided Des. Integ.Cir. Sys., September 2010.

[25] M. D. Powell, M. Gomaa, and T. N. Vijaykumar, “Heat-and-run: leveraging smt and cmp to manage power density throughthe operating system,” in Proceedings of the 11th Interna-tional Conference on Architectural Support for ProgrammingLanguages and Operating Systems, 2004.

1090

Documents

[IEEE 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) - Changsha, China (2011.11.16-2011.11.18)] 2011IEEE 10th International