33
Real-Time Systems, 15, 149–181 (1998) c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. Fault-Tolerant Rate-Monotonic Scheduling SUNONDO GHOSH [email protected] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 RAMI MELHEM [email protected] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 DANIEL MOSS ´ E [email protected] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 JOYDEEP SEN SARMA [email protected] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260 Abstract. Due to the critical nature of the tasks in hard real-time systems, it is essential that faults be tolerated. In this paper, we present a scheme which can be used to tolerate faults during the execution of preemptive real-time tasks. We describe a recovery scheme which can be used to re-execute tasks in the event of single and multiple transient faults and discuss conditions that must be met by any such recovery scheme. We then extend the original Rate Monotonic Scheduling (RMS) scheme and the exact characterization of RMS to provide tolerance for single and multiple transient faults. We derive schedulability bounds for sets of real-time tasks given the desired level of fault tolerance for each task or subset of tasks. Finally, we analyze and compare those bounds with existing bounds for non-fault-tolerant and other variations of RMS. 1. Introduction Recent advances in technology have brought much attention to migrating previously me- chanical or manual systems to embedded computing systems. Among these, systems with timing constraints (called real-time systems) have been widely used in several time criti- cal applications ranging from fly-by-wire aircraft and space shuttles, to industrial process control and smart automobiles (Baldwin, 1995; Doyle and Elzey, 1994; Hecht et al., 1994; Kopetz, 1995; Lachenmaier and Stretch, 1994; Tindell, 1994). The tasks in such applica- tions are hard real-time tasks, which have stringent timing constraints. The consequences of missing the deadline of a hard real-time task may be catastrophic. Task deadlines in hard real-time systems must be met even in the presence of faults due to their critical nature. Faults to be tolerated can be of three kinds: permanent, intermittent, and transient (Johnson, 1989). Permanent faults are caused by the total failure of a computing unit, and are typically tolerated by using hardware redundancy, such as spare processors. Transient faults are temporary malfunctions of the computing unit or any other associated components which causes an incorrect result to be computed. Intermittent faults are repeated occurrences of transient faults. Transient faults can be caused by limitations in the accuracy of electromechanical de- vices, electromagnetic radiation received by interconnections (e.g., long buses acting like receiving antennas), power fluctuations not properly filtered by the power supply, and ef- fects of ionizing radiation on semiconductor devices (Castillo, McConnel and Siewiorek,

Fault-Tolerant Rate-Monotonic Scheduling

Embed Size (px)

Citation preview

Page 1: Fault-Tolerant Rate-Monotonic Scheduling

Real-Time Systems, 15, 149–181 (1998)c© 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Fault-Tolerant Rate-Monotonic Scheduling

SUNONDO GHOSH [email protected] of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260

RAMI MELHEM [email protected] of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260

DANIEL MOSSE [email protected] of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260

JOYDEEP SEN SARMA [email protected] of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260

Abstract. Due to the critical nature of the tasks in hard real-time systems, it is essential that faults be tolerated. Inthis paper, we present a scheme which can be used to tolerate faults during the execution of preemptive real-timetasks. We describe a recovery scheme which can be used to re-execute tasks in the event of single and multipletransient faults and discuss conditions that must be met by any such recovery scheme. We then extend the originalRate Monotonic Scheduling (RMS) scheme and the exact characterization of RMS to provide tolerance for singleand multiple transient faults. We derive schedulability bounds for sets of real-time tasks given the desired levelof fault tolerance for each task or subset of tasks. Finally, we analyze and compare those bounds with existingbounds for non-fault-tolerant and other variations of RMS.

1. Introduction

Recent advances in technology have brought much attention to migrating previously me-chanical or manual systems to embedded computing systems. Among these, systems withtiming constraints (calledreal-time systems) have been widely used in several time criti-cal applications ranging from fly-by-wire aircraft and space shuttles, to industrial processcontrol and smart automobiles (Baldwin, 1995; Doyle and Elzey, 1994; Hecht et al., 1994;Kopetz, 1995; Lachenmaier and Stretch, 1994; Tindell, 1994). The tasks in such applica-tions arehard real-time tasks, which have stringent timing constraints. The consequencesof missing the deadline of a hard real-time task may be catastrophic.

Task deadlines in hard real-time systems must be met even in the presence of faults due totheir critical nature. Faults to be tolerated can be of three kinds: permanent, intermittent, andtransient (Johnson, 1989). Permanent faults are caused by the total failure of a computingunit, and are typically tolerated by using hardware redundancy, such as spare processors.Transient faults are temporary malfunctions of the computing unit or any other associatedcomponents which causes an incorrect result to be computed. Intermittent faults are repeatedoccurrences of transient faults.

Transient faults can be caused by limitations in the accuracy of electromechanical de-vices, electromagnetic radiation received by interconnections (e.g., long buses acting likereceiving antennas), power fluctuations not properly filtered by the power supply, and ef-fects of ionizing radiation on semiconductor devices (Castillo, McConnel and Siewiorek,

Page 2: Fault-Tolerant Rate-Monotonic Scheduling

150 GHOSH ET AL.

1982). Several studies have shown that transient faults are significantly more frequent thanpermanent faults (Castillo, McConnel and Siewiorek, 1982; Iyer, Rossetti and Hsueh, 1986;Siewiorek, 1978). In (Siewiorek, 1978) measurements showed that transient faults are 30times more frequent than permanent faults, while in (Iyer, Rossetti and Hsueh, 1986) 83%of all faults were determined to be transient or intermittent. In Castillo, McConnel andSiewiorek (1982) tables were provided to show that rates of transient faults are about 20times that of permanent faults.

In some real-time systems such as satellites and space shuttles, transient faults occurat a much higher frequency than in general purpose systems (Campbell, McDonald andRay, 1992). An orbiting satellite (Campbell, McDonald and Ray, 1992) containing amicroelectronics test system has been used to measure error rates in various semiconductordevices including microprocessor systems. The number of errors, caused by protons andcosmic ray ions, mostly ranged between 1 and 15 in 15-minute intervals, and was measuredto be as high as 35 in such intervals.

Due to the high occurrence of transient faults in some applications, and because physicalredundancy can be used to tolerate permanent faults in real-time systems (Moss´e, Melhemand Ghosh, 1994; Oh, 1994), we focus on the problem of transient and intermittent faultsin this paper. A scheme is presented to add time redundancy to a schedule of preemp-tive, periodic real-time tasks such that faults can be tolerated during their execution. Thecomputation of the slack needed (addition of time redundancy to the schedule) and thedevelopment of new fault-tolerant bounds for RMS are independent and orthogonal issues.Moreover, the scheme can be used for both static and dynamic priority tasks.

In the rest of this section we describe representative related work, and introduce thesystem, task and fault models used in this paper. In Section 2, an approach is presentedto incorporate time redundancy into a schedule of preemptive, periodic tasks such thatguarantees can be provided for tasks to re-execute. We compute the amount of slack anddescribe a recovery scheme, which can be used to re-execute tasks in the event of singleand multiple transient faults. Conditions that must be met by any such recovery schemeare discussed. In Section 3, the rate monotonic scheduling (RMS) schedulability boundsdeveloped in Liu and Layland (1973) are transformed into fault-tolerant bounds, and theexact analysis of RMS presented in Lehoczky, Sha and Ding (1989) is extended to includefault tolerance capabilities. In Sections 4 and 5, the RMS bounds are further improvedby making assumptions about task utilizations. The evaluation of the fault-tolerant RMSscheme and a comparison with other schemes is presented in Sections 6. Section 7 showssimulation results validating our approach. Finally, the paper is concluded in Section 8.

1.1. Related Work

To avoid the catastrophic and costly consequences of system failure in critical real-timesystems, it is essential that faults be detected and tolerated. Extensive research has beenconducted on detecting faults using both hardware (Gaisler, 1994; Mahmood and Mc-Cluskey, 1988; Miremadi and Torin, 1995; Schuette and Shen, 1987) and software (Kaneand Yau, 1975; Randell, 1975; Yau and Chen, 1980). In Miremadi and Torin (1995),three error detection mechanisms are described which can be used to tolerate transient

Page 3: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 151

faults caused by heavy-ion radiation and power-supply disturbances. A survey of varioustypes of watchdog processors for concurrent error detection is presented in Mahmood andMcCluskey (1988). In Gaisler (1994), a processing core used for embedded space flightapplications is provided with internal concurrent error-detection, mainly to detect transientfaults. In Schuette and Shen (1987), an approach is presented for the on-line detection ofcontrol flow errors caused by transient and intermittent faults using special built-in hard-ware. The concept of recovery blocks is introduced in Randell (1975) and techniques todetect control flow errors of programs are presented in Kane and Yau (1975), Yau and Chen(1980).

Transient faults in real-time systems are generally tolerated using time redundancy, whichinvolves the retry or re-execution of any task running during the occurrence of a transientfault (Ghosh, Melhem, and Moss´e, 1995; Kopetz et al., 1990; Krishna and Singh, 1993;Ramos-Thuel, 1993). This is a relatively inexpensive method of providing fault-tolerancesince it does not require additional hardware. In applications such as space and aviationapplications, reducing the hardware decreases weight, size, power consumption, and cost.

To guarantee that real-time tasks will meet their deadlines, scheduling techniques can beused to provide predictable response time. A fundamental result in real-time schedulingwas developed by Liu and Layland (1973), in which periodic preemptive tasks were studied.It is proved that, if tasks have fixed priorities that are derived from their periods, then anyset ofn tasks with a total utilization belown(21/n − 1) is schedulable on a uniprocessorsystem. This policy was called Rate Monotonic Scheduling (RMS). In recent years, RMShas been used to schedule real-time task sets in a variety of critical applications ranging fromnavigation of satellites (Doyle and Elzey, 1994) to avionics (Hecht et al., 1994; Lachenmaierand Stretch, 1994) to industrial process control (Baldwin, 1995). However, RMS does notprovide mechanisms for managing time redundancy to allow real-time tasks to completewithin their deadlines even in the presence of faults. One of the goals of this paper is toderive new utilization bounds which include provisions for the re-execution of faulty tasksscheduled using RMS.

We are aware of a few previous attempts to incorporate time redundancy into rate mono-tonic scheduling (Burns, Davis and Punnekkat, 1996; Pandya and Malek, 1994; Ramos-Thuel and Strosnider, 1995).1 Pandya and Malek (1994) analyze the schedulability of a setof periodic tasks that are scheduled using RMS and tolerate a single fault. In the event of afault, all unfinished tasks are re-executed. It is proved that no task will miss a single deadlineunder these conditions even in the presence of a fault if the utilization of the processor doesnot exceed 50%. The type of faults handled are those that can be handled by re-execution,e.g., intermittent and transient hardware faults. Ramos-Thuel (1993) and Ramos-Thuel andStrosnider (1995), present static and dynamic allocation strategies to provide fault tolerance.Two algorithms are proposed to reserve time for the recovery of periodic real-time tasks ona uniprocessor (Ramos-Thuel and Strosnider, 1995). The RMS analysis of Lehoczky, Shaand Ding (1989) has been extended to include provisions for task re-executions. Anotherstudy done recently by Burns, Davis and Punnekkat (1996) provides exact schedulabilitytests for fault-tolerant task sets. Time redundancy is employed to provide a predictable per-formance in the presence of failures. The response time analysis of Audsley et al. (1993) isextended to the fault tolerance approach. However, neither of the studies in Burns, Davis

Page 4: Fault-Tolerant Rate-Monotonic Scheduling

152 GHOSH ET AL.

and Punnekkat (1996) or Ramos-Thuel and Strosnider (1995) derives utilization bounds2

for the task sets, which is presented in this paper.Another method for tolerating faults is to use the schemes developed for scheduling

aperiodic tasks in periodic systems (Sha, Sprunt and Lehoczky, 1989; Lehoczky, Sha andStrosnider, 1987), by considering the re-executing task to be an aperiodic task which arriveswhen the task fails. However, there are some problems with this approach. First, the papersdealing with aperiodic tasks generally try to improve their response time, without takinginto consideration the deadlines of the tasks. For fault tolerance, the deadline of the re-executing task is important. Second, no recovery scheme using aperiodic servers has beendemonstrated. Third, the bounds developed for scheduling aperiodic tasks have lowerutilization than the bounds developed in this paper. In Section 6.3 we thoroughly comparethe performance of the aperiodic servers with our scheme.

1.2. System, Task and Fault Models

The usual RMS/EDF assumptions are adopted for the task model. Sets of independent,periodic, preemptive tasks are considered. A task is eligible for execution at the beginningof its period and has to complete before its deadline (the end of its period). There aren tasksin the system, denoted byτ1, τ2, . . . , τn. The computation times and periods of taskτi aredenoted byCi andTi respectively, andTi ≤ Ti+1, i = 1, . . . ,n− 1. Theutilization Ui ofτi is Ci /Ti , and thetotal utilization(same as utilization factor defined in Liu and Layland(1973)) of the system is the fraction of processor time spent in the execution of the task setand is equal to

∑ni=1 Ui .

We assume that faults are transient or intermittent, that is, only a single task is affectedby each fault. Thus, recovery from faults can be carried out by re-execution. We assumethat either the hardware or the application can detect faults. The cost of fault detectionand preemption are assumed to be negligible since they can be incorporated in the task’sCi . Because faults need to be identified before being tolerated, error detection is essential.Therefore, our approach will make use of well-known error detection mechanisms suchas fail-signal processors,alarms or watchdogs, signatures, and acceptance tests(ATs)(Campbell, McDonald and Ray, 1992; Johnson, 1989; Oh and MacEwen, 1992; Pradhan,1986).

We consider several fault models in this paper. First, we present a scheme for toleratingany single fault at any time. The scheme is then extended to tolerate multiple faults byincreasing the amount of redundancy. We specifically consider the case where any twofaults are separated by an interval ofTn + Tn−1 (Section 4) and the case where multiplefaults can occur withinTn+Tn−1, in different tasks or during the execution and re-executionof the same task (Section 5).

2. Adding Time Redundancy

Our general approach to fault tolerance is to make sure that there is enough slack in theschedule to allow for the re-execution of any task instance, if a fault occurs during its

Page 5: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 153

0 2 4 86 10 12 14 16

C1=1.5 C1=1.5 C1=1.5

C2=2

B1=1.5 B2=0.9 B3=0.6 B4=1.5 B5=0.3

Figure 1. Two tasks and the backup tasks.

execution. The insertion of slack and recovery mechanism is called theFT schemein thispaper. If no slack is added, then the corresponding scheme is referred to as theNo-FTscheme. The recovery mechanism ensures that the slack reserved in the schedule can beused for the re-execution of a task before its deadline, without causing other tasks to misstheir deadlines.

The added slack is distributed throughout the schedule such that the amount of slackavailable over an interval of time is proportional to the length of that interval. The ratio ofslack available over an interval of time is thus constant and can be regarded as the utilizationof a backup taskB. Formally, if the backup utilization isUB, and the backup time (or slack)available during an intervalL is denoted byBL , then

BL = UBL (1)

The backup task can be envisioned as occupying a backup slot between every two consec-utive period boundaries, where a period boundary is the beginning of any period. Therefore,the length of a backup slot between thekth period ofτi and l th period ofτj is given byUB(lTj − kTi ), where there is no period boundary of any tasks in the system between timeskTi and lTj . It is important to note that the backup task is just an abstraction to help inreasoning about the existing (or inserted) slack in the processor. A schedule (or “executiontimes”) resulting from the insertion of backups and the other RMS tasks is calledIBRMS(for Inserted-Backup RMS).

In the absence of faults, the existing or inserted slack in IBRMS is not used and thetasks are scheduled following the usual RMS scheme. Another way of reasoning about thebackup task is to implement it as a task with a periodTB = gcd{Ti } andCB = TBUB =TB × max{Ui }. Since this latter representation is more cumbersome, we will adopt theformer, with intervals being delimited by task periods.

As an example of a IBRMS schedule, consider two tasks withC1 = 1.5, T1 = 5,C2 = 2,andT2 = 8 (see Figure 1). The two utilizations are 30% and 25% respectively. AssumingthatUB = 30%, the lengths of backups from 0 toT1 is 1.5, fromT1 to T2 is 0.9, fromT2 to2T1 is 0.6, and so on. The utilization factor of the system is 55%. Note that the reservedslack is shown as shaded boxes in the figure at the beginning of each period.

In our FT-scheme we useoverloading3 of the backups: the same reserved time is beingused as the backup for all the tasks in the system. If a separate backup were to be scheduled

Page 6: Fault-Tolerant Rate-Monotonic Scheduling

154 GHOSH ET AL.

for each task, then the utilization of each task would double, and would reduce the maximumschedulability to half of the initial value (≈ 34.5% in the general case, which assumes thata task set is schedulable if its utilization is less than≈ 69% (Liu and Layland, 1973)).

2.1. Conditions for Recovery from a Single Fault

In this section, conditions are stated to ensure recovery from single faults in a preemptive,periodic, uniprocessor system.

First assume that no faults occur in the system. It is shown later that the utilization boundderived using theFT schemeformulas is always smaller than the corresponding bound ofLiu and Layland (1973). Therefore, if the task set satisfies a bound with theFT scheme, itwill also satisfy the original RMS bound, and is thus schedulable.

Now let us consider the schedulability of tasks in the event of a failure and subsequentre-execution ofany onetask. A schedulability bound for such a scenario can be derivedby considering the faulty task as one with twice the utilization of the original task. Dueto overloading all single faults can be tolerated if the utilization bound of the system isdecreased by a backup utilization equal to the maximum of all task utilizations (UB =max{Ui }). Specifically, if the Liu and Layland bound (for RMS) is denoted byULL , thena corresponding naive bound that takes backup utilizations into account can be determinedas:

Unaive= ULL −UB (2)

For large values ofn, the RMS boundULL ≈ 69%. Thus, ifUB = 25% (for example),then according to the above formula, the utilization bound for any task set with largen canbe 44%. This is obviously not a tight bound, since Pandya and Malek (1994) have shownthe bound to be 50%. Below we will develop a recovery (re-execution) scheme for thefaulty tasks and the corresponding tight conditions for task schedulability given a set oftasks and the utilization of the backup.

The following conditions must hold true to ensure re-execution of any one faulty task :

[S1]: For every taskτi , a slack of at leastCi should be present betweenkTi and(k+ 1)Ti .

[S2]: If there is a fault during the execution of taskτr then the recovery scheme shouldenable taskτr to re-execute for a durationCr before its deadline.

[S3]: When a task re-executes, it should not cause any other task to miss its deadline.

To satisfy[S1] we distribute slack throughout the schedule in the form of backup slots ateach period boundary (IBRMS). It will be shown in Section 2.3 that if the backup utilizationis equal to the maximum of the task utilizations (UB = max{Ui }) then[S1] is satified. InSection 4 we derive conditions for the schedulability of a task set for a given backuputilization, when IBRMS is used for scheduling. Henceforth we assume that whenever atask set is said to satisfy[S1], it satisfies the schedulability test in Section 4. Note that if atask set satisfies[S1], then no instance of any task in this task set will miss its deadline inthe IBRMS schedule.

Page 7: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 155

Queue ReadyQ; /* Queue of ready tasks */Queue BlockedQ; /* Queue of delayed tasks */

/* System is in recovery mode andτr is recovering */

Event: New instance of taskτj arrivesif ( Dj > Dr ) then

Add τj to BlockedQ;else /* as in RMS */

add to ReadyQexecute as in rate monotonic scheduling

Event:An instance of taskτj terminatesif (τj is the recovering task) /* Recovery complete */

move all tasks from BlockedQ to ReadyQ;endifexecute as in rate monotonic scheduling

Figure 2. Recovery algorithmRS.

As mentioned earlier, the slack is not used until there is a fault. The re-execution mech-anism described next allows us to use the reserved slack for re-execution after a failure.If the task set satisfies[S1] then this recovery scheme ensures that both[S2] and[S3] aresatisfied.

2.2. Recovery Scheme

When a fault is detected at the end of the execution of some taskτr the system goes into therecovery mode. In this mode,τr will re-execute at its own priority, except for the followingcase. During recovery mode, any instance of a task that has a priority higher than that ofτr and a deadline greater thanDr will be delayed until recovery is complete. We refer tothis recovery scheme asRS. It will be shown that if the task set satisfies[S1] thenRSwillsatisfy both[S2] and [S3]. The recovery algorithm is presented in Figure 2. The actualexecution of a task set under RMS, using recovery scheme RS for re-execution will becalledFTRMS4 (for fault-tolerant RMS).

It is worth noting that several simple solutions like executingτr at the highest priority orsimply re-executingτr at its own priority (which does not affect higher priority tasks) donot work. In the former case,τr may cause a higher priority taskτh to miss its deadlines(violating [S3]), for example, ifCr > Th. In the latter case, higher priority tasks maypreventτr from being able to obtain the reserved slack within its deadline ([S2] is violated).

Page 8: Fault-Tolerant Rate-Monotonic Scheduling

156 GHOSH ET AL.

C1B1 C1C1C1B2 B3 B4

C2

C1C1C1C1 B3 B4B1 B2

C2

C1C1C1B1C1 B2 B3 B4

C2

(a)

(c)

(b)

scheduled

backup

executing

Figure 3. Swapping ofC1 andB1 (b) and thenC2 with B1 (c).

2.3. Proof of Correctness

THEOREM1 If UB ≥ Ci /Ti , i = 1, . . . ,n, then the amount of slack present within anyinstance of taskτi is at least Ci and[S1] is satisfied.

Proof: From Eq. (1), the amount of slack available within any periodTj is UBTj . SinceUB ≥ Ui , i = 1, . . . ,n, the amount of idle time withinTj is Tj UB ≥ Tj Cj /Tj = Cj andhence[S1] is satisfied.

To show that[S2] is met, we use the concept ofswapping. We define swapping to be theusage of slack for the execution of a task, leading to the creation of slack of equal lengthfollowing the task. The slack is effectively shifted later in time, that is, it is swapped withthe task’s computation time5 making the slack available at the end of its execution. In otherwords, the FTRMS schedule can be obtained from the IBRMS schedule by the process ofswapping and including the recovery scheme when needed (i.e., when in recovery mode).Swapping is illustrated in Figure 3 where the black areas in the figure show the currentlyrunning part of a task, while the gray areas indicate a backup. The computation of a taskτ1 gets swapped with backup timeB1 (shown in Figure 3(b)) and then a part of a secondtaskτ2 gets swapped with the previously swapped backupB1 (shown in Figure 3(c)). Thesecond part ofτ2 executes next, swapping withB1 sinceτ1 is not yet ready, and so on.

THEOREM2 If [S1] is satisfied and the recovery schemeRSis followed then[S2] is satisfied.

Proof: Assume that a fault occurs during the execution of taskτr which has a deadlineDr .Since[S1] is satisfied, sufficient slack was reserved for the re-execution ofτr in the formof backup slots at each period boundary in the IBRMS schedule. However, during FTRMSexecution, the slack is not used until failure. In this case the slack which is reserved in theIBRMS schedule gets swapped with the execution of other tasks and gets pushed forward.

Page 9: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 157

To show that[S2] is satisfied it is sufficient to show that, withRS, it is not possible for thisslack to be swapped forward to a point beyond the deadline ofτr . Under this condition theslack reserved forτr is available within its deadline and can be used byτr for re-execution.

Let τh be any task with a priority higher thanτr which arrives before the recovery iscomplete and letDh be its deadline. Such a task may preemptτr and cause swapping. Weidentify three distinct cases depending on the arrival time and deadline ofτh. For eachcase we show that the swapping of the slack with the execution time ofτh cannot push thereserved slack beyond the deadline of the recovering task. We will use the three cases asshown in Figure 4 for the purposes of the proof.

Case I: Dh ≤ Dr

Due to the recovery schemeRS, τh will preemptτr and the reserved slack is swapped withthe execution time ofτh. Thus, the swapped slack lies withinDh. SinceDh ≤ Dr weconclude that any slack swapped withτh lies within Dr .

Case II: Dh > Dr andτh arrives after the fault.The execution ofτh is delayed until the completion of recovery. Therefore,τh cannot pre-vent the re-execution of taskτr . Case II of Figure 4 depicts this scenario, where the dashedlines represent the preemption ofτr by instances of tasks belonging to Case I.

Case III: Dh > Dr andτh arrives before the fault.Here we will use proof by contradiction. Assume that the swapping of slack with theexecution time ofτh pushes some of the slack beyondDr . Case III of Figure 4 depicts thisscenario. This implies that in the IBRMS scheduleτh executes, at least partially, afterDr

and therefore the finish time ofτh in the IBRMS schedule is greater thanDr . Sinceτh hasa priority higher than that ofτr , τr cannot execute from the timeτh arrives until the finishtime ofτh. Thus in the IBRMS schedule, ifτr is to meet its deadline, then it must completebefore the arrival time ofτh. Since[S1] is satisfied, all tasks meet their deadlines in theIBRMS schedule and thereforeτr must complete before the arrival ofτh in the IBRMSschedule.

By assumption, under FTRMS execution, the fault inτr is detected after the arrival timeof τh. This implies that under FTRMS executionτr does not complete before the arrivaltime of τh. It is not difficult to see that the finish time of any task under IBRMS is alwaysgreater or equal to its finish time under FTRMS (because in IBRMS the tasks are delayeddue to insertion of the backup slots). Thus under IBRMSτr will not have completed beforethe arrival time ofτh. However this contradicts our earlier deduction thatτr completesbefore the arrival ofτh in IBRMS. Hence we must conclude that our starting assumption isinvalid and that the slack cannot be pushed beyondDr when it is swapped withτh.

Thus we have been able to show that when[S1] is satisfied, the recovery mechanismensures that the slack reserved forτr is available for reexecution within its deadline andhence[S2] is satisfied.

As an illustration, in Figure 5 we show the same task set of Figure 3 after the first twoinstances ofC1 andC2 have completed their executions. Ifτ2 fails then it can use the

Page 10: Fault-Tolerant Rate-Monotonic Scheduling

158 GHOSH ET AL.

FaultDetected

FaultDetected

Case I Case IIICase II

Ch Dh

Slack

DrDr

Dh

RecoveryComplete

τh starts

T Dh

Dr

Slack

hC

Figure 4. Swapping corresponding to Cases I, II and III of Theorem 3.

C1C1B3 B4C1C1

C2

B1 B1 & B2

Figure 5. Swapping ofC2 with backups complete.

swapped slack space (shown as shaded boxesBi ) for reexecution. The figure illustrates that[S2] is satisfied forC2.

To prove that the recovery mechanism satisfies[S3] we first prove two lemmas.

LEMMA 1 If [S1] is satisfied and the recovery schemeRS is followed then no task with apriority lower thanτr will miss its deadline due to the re-execution ofτr .

Proof: Let τr be the re-executing task andτl a lower priority task such thatTr ≤ Tl . Letus consider three scenarios: (1)Tr starts before the beginning ofTl and ends withinTl , (2)Tr is fully contained withinTl , and (3)Tr starts withinTl and ends afterTl finishes.

In cases 1 and 2, whenTr finishes before the end ofTl , there is sufficient slack withinTr to complete the re-execution ofτr . If τr re-executes, it simply uses up the slack alreadyavailable before the end ofTr , and recovery completes without displacingτl . Thus,τr willnot causeτl to miss its deadline. Note that ifτr does not re-execute, swapping occurs andτl will finish earlier than expected.

Case 3, when the end ofTr is after the end ofTl , is slightly more complex. Consider thecase shown in Figure 6, whereτl will finish exactly at its deadline if IBRMS is used. Taskτr will need up toUBTr to re-execute, which is greater than the backup available betweenthe beginning ofTr and the end ofTl . Thus, it appears that the re-execution ofτr at its own

Page 11: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 159

zyx

B1 y UB z UB

Bx U T l

T r

Figure 6. Taskτr finishes afterτl .

D

D

h

r

C hτ h delayed

τ h

τ rRe-execution of

Figure 7. Delayed taskτh misses its deadline.

priority could pushτl beyond its deadline. However, it is proved below thatτl will stillmeet its deadline.

Let x, y, andz be the lengths of time as shown in Figure 6. Since it is assumed that thereis a single fault, then any re-executing task will not use the backup (xUB) within Tl . Thus,τl is able to use the slack of lengthxUB available between the beginnings ofTl andTr ,and finish earlier byxUB. The re-execution ofτr delaysτl by (y + z)UB, which iszUB

more than the slack scheduled before the end ofTl . Even thenτl will meet its deadlinebecause it has used upxUB slack available earlier (due to swapping), andx > z (sinceTl > Tr ⇒ x + y > y+ z).

LEMMA 2 If [S1] is satisfied and the recovery schemeRS is followed then no task with apriority higher thanτr will miss its deadline due to re-execution ofτr .

Proof: Here we need to consider only those higher priority tasks whose execution isdelayed during the recovery mode, that is, those higher priority tasks which arrive after thedetection of failure and have a deadline later thanDr .

We will use proof by contradiction. Assume to the contrary that such a delayed higherpriority task,τh, misses its deadline. Figure 7 shows such a scenario where the slack is usedby τr for re-execution. Since[S1] is satisfied, in the equivalent IBRMS schedule for thesame task set,τh would meet its deadline. This means that some of the slack which is usedby τr in FTRMS would lie beyondDr in the IBRMS schedule. However this is impossiblesince by[S1], the slack needed byτr is reserved within its deadline in the IBRMS schedule.

Page 12: Fault-Tolerant Rate-Monotonic Scheduling

160 GHOSH ET AL.

Thus we must conclude that any delayed higher priority tasks will meet its deadline.

THEOREM3 If [S1] is satisfied and the recovery mechanismRSis used then[S3] is satisfied.

From Lemmas 1 and 2 we can directly conclude that the re-execution ofτr can causeneither higher nor lower priority tasks to miss their deadlines and hence[S3] is satisfied.

2.4. Minimum Fault Interval

So far we have considered the failure and subsequent re-execution of any one task. Inthis section we prove that faults can be tolerated if they occur one at a time no closer thanTn + Tn−1, whereTn andTn−1 are the periods of the two lowest priority tasks.

THEOREM4 One fault can be tolerated within Tn+Tn−1 if the slack is distributed accordingto (1) with UB = maxi {Ci

Ti} and the recovery schemeRS is used.

Proof: Assume to the contrary that some fault cannot be tolerated under this conditionand let that fault occur withinTi , during taskτi ’s execution. Since[S1] and [S2] aresatisfied, there is enough slack in the schedule for the re-execution ofτi unless another taskτj has used the slack for re-execution. Therefore, we can conclude that a fault occurredwithin Tj and thatTj overlapsTi . That is, two faults occured withinTj + Ti . SinceTj + Ti ≤ Tn−1 + Tn, this contradicts the condition that faults can only arrive once everyTn+Tn−1 units of time. Hence we conclude that one fault can be tolerated withinTn+Tn−1.

A single fault being tolerated withinTn + Tn−1 is only the worst case scenario. In mostcases, more than one fault can be tolerated withinTn + Tn−1, but guarantees cannot beprovided. For example, if the faults occur only while each instance ofτ1 is executing, thenone fault can be tolerated in eachT1.

2.5. Recovery from Multiple Faults

In this section, multiple faults withinTn+Tn−1 are considered. Assume that the schedule hasm backupsBj , j = 1, . . . ,m with corresponding utilizationsUBj . Let UBT =

∑mj=1 UBj .

The utilization of the backup has to be the total task utilization for which fault tolerance isrequired.

Various kinds of multiple faults can be tolerated in the following ways:

• If UBj ≥ Ui ∀ j : j = 1, . . . ,m, ∀i : i = 1, . . . ,n, then at leastm faults can be toleratedwithin Tn + Tn−1. The minimum value ofUBT in this case ism× max{Ci /Ti }, i =1, . . . ,n.

• If UBT = mCi /Ti , where 1≤ i ≤ n, then all tasksτj with Uj ≤ Ui can tolerate at leastm faults withinTn + Tn−1.

Page 13: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 161

• If only a single fault must be tolerated for all tasksτi , i = 1, . . . ,n in the system, butk additional faults must be tolerated for a setRC of m critical tasks, then the backuputilization is given by

UBT = max{Ci /Ti } + k×max{Cj /Tj | τj ∈ RC}

Clearly, for any desired level of fault tolerance for various tasks, a different value ofUBT

can be derived. The recovery scheme and its correctness proof for each case is similarto that described in the case of single faults. Multiple faults will be further discussed inSection 5.

3. Schedulability Analysis of Fault Tolerant RMS

The scheme described in the previous section is a general approach to distribute slack in theschedule, and can be applied to any non-fault-tolerant scheduling scheme for preemptive,periodic tasks where the RMS assumptions hold. In this section we discuss how distributingthe slack uniformly affects the well-known RMS utilization bound.

3.1. A General Fault-Tolerant RMS Bound

As theUnaive bound in (2) shows, a straightforward way of computing a schedulability boundfor fault-tolerant RMS would be to decrease the bound in Liu and Layland (1973) byUB,(i.e.,ULL−UB). However, adding a backup with utilizationUB distributed uniformly acrossthe schedule is equivalent to scheduling the tasksτi ,1≤ i ≤ n on a virtual processor withonly (1−UB) of the processor’s computation power. Thus, the following bound is obtainedwhich is a general fault tolerance bound for RMS (UG-FT-RMS) when recovery schemeRSis used. We note thatUG-FT-RMS is a tighter bound thanUnaive for all values ofUB.

UG-FT-RMS= n(21/n − 1)(1−UB) = ULL(1−UB) (3)

The following example shows how (3) is used to determine whether a task set is schedula-ble or not. Consider a set of 3 tasks with utilizationsU1 = 10%,U2 = 20%, andU3 = 25%.If we assume that a single fault needs to be tolerated, thenUB = 25%.UG-FT-RMSgives us abound of 58.5% while the sum of utilizations of the task is 55%. Since

∑Ui < UG-FT-RMS,

the tasks are schedulable.Single faults and multiple faults withinTn + Tn−1 can be guaranteed to be tolerated by

using the recovery schemeRSand the utilization bound described above. If several backupsare provided in the system, and the total backup utilization isUBT, then a general boundfor the task set can be derived by replacingUB with UBT in (3), that is, the new bound isULL(1−UBT).

Page 14: Fault-Tolerant Rate-Monotonic Scheduling

162 GHOSH ET AL.

3.2. Exact Analysis of Fault-Tolerant RMS

In Lehoczky, Sha and Ding (1989), a necessary and sufficient schedulability criterion wasderived for fixed-priority periodic task sets. According to this criterion, given a set ofperiodic tasksτ1, τ2, . . . , τn, a taskτi , can be scheduled for all task phasings using the ratemonotonic algorithm if and only if

min0<t≤Ti

{Wi (t)

t

}≤ 1 (4)

whereWi (t) is the cumulative demand on the processor made by tasksτ1, . . . , τi over [0,t](0 is the worst-case phasing), and is computed using the following expression

Wi (t) =i∑

j=1

Cj

⌈t

Tj

⌉(5)

Taskτi is schedulable because the demand on the processorWi (t) is smaller than theprocessing timet available. If the system is to be fault tolerant, with the recovery mechanismRSbeing used in the event of failure, then the re-execution of the tasks must also be accountedfor in the workload. We assume that at most one fault occurs inTn + Tn−1. The additionalprocessor demand during the execution ofτi due to the re-execution of higher priority tasks,Hi , can at most be equal to the maximum of the computation times of any higher prioritytask.τi can also be delayed due to the re-execution of lower priority tasks while the systemis in the recovery mode. However this delay,Li , cannot be any longer thanTi ·UB - whichis the amount of slack reserved during the period ofτi . The expressions forHi andLi canbe written as:

Hi = imaxj=1

(Cj ) (6)

Li = min(Ti ·UB,n

maxj=i+1

(Cj )) (7)

The expression for the fault-tolerant workloadW′i (t) is :

W′i (t) =i∑

j=1

Cj

⌈t

Tj

⌉+max(Hi , Li )

= Wi (t)+max(Hi , Li ) (8)

The fault-tolerant schedulability criterion can now be described as follows. Given a set ofperiodic tasksτ1, τ2, . . . , τn, a taskτi can be scheduled for all task phasings using the ratemonotonic algorithm and one fault can be tolerated withinTn+ Tn−1 by using the recoveryschemeRSif and only if

min0<t≤Ti

{W′i (t)

t

}≤ 1 (9)

Page 15: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 163

4. Improved RMS Bounds for Tolerating Single Faults

We start this section with some definitions from Liu and Layland (1973) to facilitate thediscussion. A processor is said to befully utilizedwhen an increase in the computation timeof any of the tasks in the set will make the set unschedulable. Theleast upper boundof thetotal utilization is the minimum of all total utilizations over all sets of tasks that fully utilizethe processor. Therefore, if a task set has a total utilization smaller than the least upperbound, then it is schedulable. Clearly, the total utilization will decrease when a fraction ofthe processing capacity is reserved for re-execution.

Given a set of tasksτ1, τ2, . . . , τn, and a backup utilization,UB, we would like to improveon the least upper bounds of the total utilization given in Equations (3) and (2). We willachieve this goal by assuming thatUB = Ui , for somei , 1 ≤ i ≤ n, and following theproof strategy used in Liu and Layland (1973). Specifically, the following steps are used:

• Step 1: Determine the critical instant. The critical instant occurs when the responsetime of each task is maximum, and thus the total utilization is minimized.

• Step 2: Determine the relationships betweenCi andTj (i, j = 1, . . . ,n) for which theupper bound of the total utilization is minimized, assuming thatT1 ≤ Ti ≤ 2T1, i =2, . . . ,n.

• Step 3: For the periodsTi determined in Step 2, find the relationship between the backuputilizationUB, the number of tasksn, and the least upper bound of the total utilization.

• Step 4: Generalize the bound for arbitrary task periods, relaxing the assumption inStep 2.

Once a fault-tolerant utilization bound has been found, it can be determined whether ornot a given set of tasks is schedulable with fault tolerance. The derivation for two tasks, asimpler and more illustrative case, is described in Ghosh (1996).

4.1. Determining the Critical Instant

Liu and Layland (1973) have shown that, for a set of periodic preemptive tasks, the criticalinstant occurs when all tasks arrive simultaneously. If no faults occur, then the argumentsused by Liu and Layland (1973) are also applicable to theFT scheme. However, theFTschemealso needs to consider the case where a fault occurs and a task has to be re-executed,thus adding to the response time of tasks. If a single fault occurs, the maximum responsetime for τi occurs when some taskτj and with the maximum computation time amongall tasks needs to re-execute. In this case,τi gets delayed by an additionalCj due to there-execution. This additional factor is maximum whenτi arrives at the same time asτj , sothat theτi gets delayed by exactlyCj . Thus, the maximum response time of a task occurswhen it arrives with all other tasks of higher priority. That is, the critical instant occurswhen all tasks arrives at the same time.

Page 16: Fault-Tolerant Rate-Monotonic Scheduling

164 GHOSH ET AL.

B1 C1C1 B2

C2 C2B3

C3 C3B4

C4

T2

T3

T4

2 T1T1

Figure 8. Four tasks withTi+1 = Ti + Bi+1 + Ci .

4.2. MinimizingU for Varying Computation Times

In this section, assume thatT1 ≤ Ti ≤ 2T1, i = 1, . . . ,n. The values ofCi ’s are determinedfor which the total utilization is minimized, while maintaining a fully utilized processor.LetC1,C2, . . . ,Cn be the computation times such that the processor is fully utilized. Using(1), the length of the backup within the first instance ofT1 is B1 = T1UB, and the lengthsof backups between the end ofTi and the end ofTi+1 is Bi+1 = (Ti+1 − Ti )UB, as shownin Figure 8. In particular,C1 = T2− T1− B2 = (T2− T1)(1−UB).

First the effect of varyingC1 andC2 on the total utilization is determined. If the valueof C1 decreases by1(1−UB), thenC2 has to be increased by 21(1−UB) to ensure thatthe processor is still fully utilized (since the decrease of the first two instances ofC1 allowmore time withinT2). The new utilization is

U ′ = C1−1(1−UB)

T1+ C2+ 21(1−UB)

T2+ · · · + Cn

Tn

The change in utilization is positive:U ′ −U = (−1/T1+ 2/T2)1(1−UB) > 0.Similarly, if the value ofC1 increases by1(1− UB), thenC2 has to be decreased by a

value of1(1− UB) to ensure that the processor is still fully utilized. Note that the factorby which to decreaseC2 is only1(1− UB), since the increase in the first instance ofC1

is within T2, but the second instance ofC1 increases outsideT2 (thus not requiringC2 tochange). This new value of utilization is

U ′ = C1+1(1−UB)

T1+ C2−1(1−UB)

T2+ · · · + Cn

Tn

As before, the change in utilization is positive:U ′ −U = (1/T1− 1/T2)1(1−UB) > 0.Therefore,U is minimum whenC1 = (T2 − T1)(1−UB). Similarly, given thatBi+1 =

(Ti+1 − Ti )UB, i = 1, . . . ,n − 1, it can be shown using the same technique thatU is

Page 17: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 165

minimum whenCi = Ti+1− Ti − Bi+1 which yields

Ci = (Ti+1− Ti )(1−UB), i = 1, . . . ,n− 1 (10)

Since the processor is fully utilized,Tn = 2(C1+C2+· · ·+Cn−1)+Cn+Tn ·UB. Therefore

Cn = Tn − 2(C1+ C2+ · · · + Cn−1)− TnUB

= (2T1− Tn)(1−UB) (11)

An example of a schedule for four tasks satisfying (10) and (11) is shown in Figure 8.

4.3. MinimizingU for a GivenUB

The boundUG-FT-RMS (3) was derived by simply reserving enough slack for tasks to re-execute. However, by using additional information, a better bound can be achieved. Thisadditional information takes the form of an assumption:The value ofUB is equal to oneof the task utilizations, that is,∃k,1≤ k ≤ n |UB = Uk.

In practice, this is easily achieved, since all the task utilizations are known. Surprisingly,this simple assumption helps us to significantly improve upon theUG-FT-RMSbound.

Assume thatUB = Uk = CkTk

, for somek 6= 1 andk 6= n. The cases ofk = 1 andk = nare the boundary cases, and have a similar derivation as the general case (Ghosh, 1996).Substituting the value ofCk (equal toTkUB) in (10) yields

Tk = Tk+1(1−UB) (12)

Using (10), (11) and (12), the total utilization is

U =n∑

i=1

Ci

Ti=

n−1∑i=1;i 6=k

[(Ti+1− Ti )(1−UB)

Ti

]+ (2T1− Tn)(1−UB)

Tn+Uk

=n−1∑

i=1;i 6=k

[(Ti+1

Ti− 1

)(1−UB)

]+(

2T1

Tn− 1

)(1−UB)+UB

= UVAR(1−UB)− (n− 1)(1−UB)+UB

whereUVAR =n−1∑

i=1;i 6=k

[Ti+1

Ti

]+ 2T1

Tn(13)

From here onwards, onlyUVAR is considered, since the other terms contributing toU arenot dependent on the periodsT1, . . . , Tn. Let hi = Ti

Tn, i = 1, . . . ,n. Using (12) and (13)

UVAR=n−1∑

i=1;i 6=k−1;i 6=k

hi+1

hi+ hk+1(1−UB)

hk−1+ 2h1

hn(14)

Taking the derivatives ofUVAR with respect to eachhi , i = 1, . . . ,n− 1; i 6= k (i 6= kfrom (13)), and making them equal to 0 to find the minima, the following set of equations

Page 18: Fault-Tolerant Rate-Monotonic Scheduling

166 GHOSH ET AL.

is obtained. Note that, by definition,hn = 1.

dU

dh1= 0 ⇒ − h2

(h1)2+ 2= 0 ⇒ 2(h1)

2 = h2

dU

dhi= 0 ⇒ 1

hi−1− hi+1

(hi )2= 0 ⇒ (hi )

2 = hi−1hi+1,

i = 2, . . . , k− 2, k+ 2, . . . ,n− 1dU

dhk−1= 0 ⇒ 1

hk−2− hk+1(1−UB)

(hk−1)2= 0 ⇒ (hk−1)

2 = hk−2hk+1(1−UB)

dU

dhk+1= 0 ⇒ 1−UB

hk−1− hk+2

(hk+1)2= 0 ⇒ (hk+1)

2 = hk−1hk+2

(1−UB)

Let the above set of equations be calledE1. These equations (E1) lead to the followingrelationships.

hi = (hi+1)( i

i+1 )

2(1

i+1 )i = 1, . . . , k− 2

hk−1 = (hk+1(1−UB))( k−1

k )

2(1k )

hk+1 = (hk+2)( k

k+1 )

(2(1−UB))( 1

k+1 )

hj = (hj+1)(

j−1j )

(2(1−UB))( 1

j )j = k+ 2, . . . ,n− 2

hn−1 = 1

(2(1−UB))( 1

n−1 )(15)

The equations derived inE1can be rearranged to find that

h2

h1= h3

h2= · · · = hk−1

hk−2= hk+1(1−UB)

hk−1= hk+2

hk+1= · · · = hn

hn−1= 2h1

hn(16)

while (15) gives us

hn

hn−1= 1

hn−1= [2(1−UB)]

( 1n−1 ) (17)

Using (13), (14), (16) and (17),

U1-FT-RMS = (n− 1)(1−UB)([2(1−UB)]1

n−1 − 1)+UB (18)

If a task set meets the condition∑n

i=1 Ui ≤ U1-FT-RMS, then it is schedulable, and anytaskτi with Ui ≤ UB can tolerate any number of faults within the assumptions stated above.

Page 19: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 167

4.4. MinimizingU for Task Sets with Arbitrary Periods

In the Sections 4.2 and 4.3, task sets with the restrictionT1 ≤ Ti ≤ 2T1, i = 1, . . . ,n, wereconsidered. To prove thatU1−FT−RMS is the least upper bound of the utilization factor forall task sets, we also need to consider tasks sets withTi > 2T1 for one or morei = 1, . . . ,n.This restriction can be relaxed by using the results given in Lemma 2 in Oh and Son (1994),which describe a task set transformation and preserves the utilization of the tasks duringthe transformation.

The modification necessary is to introduce a backup task as described in Section 2. Theperiod of this backup task is the greatest common divisor of all tasks in the system, andthe utilization of the backup task will depend on the characteristics of the system (faultcharacteristics and level of fault tolerance). Note that the assumption that∃k,1 ≤ k ≤n |UB = Uk is still valid during the transformation.

The proofs for (18) remain the same, and thus (18) gives the least upper bound of theutilization factor for all tasks sets withn tasks and backup utilizationUB.

5. Improved RMS Bounds for Tolerating Multiple Faults

In this section, we consider two ways in which multiple transient faults can occur. The firstis when each fault occurs while different tasks are executing (Section 5.1), and the secondis when a fault occurs when the primary and again when the backup of the same task areexecuting (Section 5.2). If the total backup utilization isUBT, then a general bound for thetask set can be derived by replacingUB with UBT in (3), that is, the new bound will beULL(1−UBT). Again, if we use additional information, we can improve this bound.

5.1. Multiple Faults in Different Tasks

Assumem backups and each has its utilization equal to one of the task utilizations, that is,∀ j : j = 1, . . .m, ∃i : 1 ≤ i ≤ n,UBj = Ui . Then a mapping can be produced from eachbackup indexj to some task indexi . Let i = σ( j ) denote the mapping from backup indexj to task indexi , andS be the set of task indicesi such thatUi is equal to someUBj , thatis, ∃ j,UBj = Ui = Uσ( j ). Formally,S= { i | i = σ( j ) for j = 1, . . . ,m}.

For example, ifUB1 = U4 andUB2 = U7, thenσ(1) = 4, σ(2) = 7, and S ={4, 7}. Thetask indices in setS represent task utilizations which are known, and therefore constant6.In that case, the following equations are obtained:

UBj = Uσ( j ) = Cσ( j )

Tσ( j )= Ci

Ti, j = 1, . . . ,m (19)

Ci ={(Ti+1− Ti )(1−UBT), if i < n(2T1− Tn)(1−UBT), if i = n

(20)

Using (20) the total utilization can be derived as follows. We assume that 1,n 6∈ S aswe assumed 1,n 6= k in the single fault case, for simplicity of presentation. A similar

Page 20: Fault-Tolerant Rate-Monotonic Scheduling

168 GHOSH ET AL.

derivation of these cases generates an identical result (Ghosh, 1996).

U =n∑

i=1

Cn

Tn=

n−1∑i=1, i 6∈S

[(Ti+1− Ti )(1−UBT)

Ti

]+ (2T1− Tn)(1−UBT)

Tn

+n−1∑

i=1, i∈S

Ci

Ti

=n−1∑

i=1;i 6∈S

[(Ti+1

Ti− 1

)(1−UBT)

]+(

2T1

Tn− 1

)(1−UBT)+UBT

= UVAR(1−UBT)− (n−m)(1−UBT)+UBT

whereUVAR =n−1∑

i=1;i 6∈S

[Ti+1

Ti

]+ 2T1

Tn(21)

As before, onlyUVAR is considered from here onwards. The set of task indicesS can bebroken into smaller setsS1, S2, . . . , Sp, such that each subset has consecutive task indices.If no two indices inS are consecutive, then there arem subsets with one index in eachsubset. Let the number of indices inSk be bk, k = 1, . . . , p (i .e., |Sk| = bk). Thus,b1+ b2+ · · · + bp = m.

Substituting the values ofCi (i < n) from (20) into (19), and usingσ( j ) = i (wheni ∈ S)

Tσ( j ) = Tσ( j )+11−UBT

1−UBT +UBj, j = 1, . . . ,m (22)

Assume that∀l : i, i + bl + 1 6∈ S, i + 1, i + 2, . . . , i + bl ∈ S, andbl = |Sl | such thati +1, i +2, . . . , i +bl ∈ Sl . Also assume thatσ(xi ) = i +1, . . . , σ (xi +bl −1) = i +bl .Then, using (22),

Ti+1 = Ti+bl+1(1−UBT)

bl∏xi+bl−1j=xi

(1−UBT +UBj )(23)

Lethk = TkTn, k = 1, . . . ,n. Also, letFi = (1−UBT)

bl∏xi +bl−1

j=xi(1−UBT+UBj )

for l = 1, . . . , p, i+1 ∈ S

and σ(xi ) = i + 1. Then, applying (23) to (21),

UVAR=n−1∑

i=1;i 6∈S;i+16∈S

[hi+1

hi

]+

n−1∑i=1;i 6∈S;i+1∈S

[hi+bl+1

hiFi

]+ 2h1

hn(24)

Thereafter, we take the derivatives ofUVARwith respect to eachhi , i = 1, . . . ,n−1, i −1 6∈ S; i 6∈ S; i + 1 6∈ S, and making them equal to 0 to find the minima, the following setof equations is obtained. (Note that, by definition,hn = 1.)

Page 21: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 169

dU

dh1= 0 ⇒ − h2

(h1)2+ 2= 0 ⇒ 2(h1)

2 = h2

dU

dhi= 0 ⇒ 1

hi−1− hi+1

(hi )2= 0 ⇒ (hi )

2 = hi−1hi+1, i 6∈ S, i + 1 6∈ S

The next set of equations are obtained by taking the derivatives with respect tohi andhi+bl+1, wherei 6∈ S, i + 1 ∈ Sandi + 1 ∈ Sl .

dU

dhi= 0 ⇒ 1

hi−1− hi+bl+1

(hi )2Fi = 0 ⇒ (hi )

2 = hi−1hi+bl+1Fi

dU

dhi+bl+1= 0 ⇒ 1

hiFi − hi+bl+2

(hi+bl+1)2= 0 ⇒ (hi+bl+1)

2 = hi hi+bl+2Fi

These equations can be solved to find the utilization bound for multiple faults:

UM-FT-RMS= (n−m)(1−UBT)

[2∏

j :σ( j )∈S

(1−UBT

1−UBT +UBj

)] 1n−m

− 1

+UBT (25)

whereUBT = 6mj=1UBj . Note that (25) reduces to (18) whenm= 1.

Now consider a specific example to show how to use (25). Consider a set of 5 tasks withutilizationsU1 = 4%,U2 = 5%,U3 = 10%,U4 = 17%, andU5 = 20%. Assume that onebackup is needed for all the tasks, and an additional backup is needed for the critical tasksτ1 andτ2. Therefore,UB1 = 20% andUB2 = 5%. For this situation, (25) gives a boundof 56.4%, while

∑Ui = 56%. Since

∑Ui < UM-FT-RMS, the tasks are schedulable with

the required degree of fault tolerance. On the other hand, (3) gives a bound of 55.8%, andthus, using the general bound (3), the tasks would be considered unschedulable.

Step 4 for the multiple fault model is similar to Step 4 for the single fault model describedin Section 4.4.

5.2. Multiple Faults in Any Task

In this section, the case in which a fault might occur during the re-execution of a task isalso considered. In certain cases, the system might need more than one backup for somecritical tasks (or task sets), which must be completed within its deadline even if faults occurduring their re-execution. If there areki backups for taskτi , then (19) can be written as:

UBj = ki Ui = kiCi

Ti(26)

The above derivation can be repeated for this new case, and the general form of the faulttolerant RMS bound is:

UFT-RMS= (n−m)(1−UBT)

[2∏

j :i=σ( j );i∈S

(1−UBT

1−UBT +UBj/ki

)] 1n−m

− 1

+UBT

(27)

Page 22: Fault-Tolerant Rate-Monotonic Scheduling

170 GHOSH ET AL.

The utilization bounds obtained above in (25) and (27) can be used to tolerate multiplefaults. Equation (27) allows great flexibility with different tasks (or task sets) being ableto tolerate variable number of faults. If there are only a few critical tasks in the system,and their utilization is small, then multiple backups can be provided for that set of criticaltasks. The bound is so flexible that it can determine the least upper bound of the totalutilization for any combination of backups, for example, one backup for all tasks in thesystem (with utilization equal to the maximum utilization of all tasks), an additional backupfor the critical tasks (with utilization equal to the maximum utilization of the critical tasks),and a third backup for the most critical task in the system.

6. Study of the Schedulability Bounds

The purpose of this section is to analyse the new fault-tolerant bounds obtained for RMS andto compare them with other proposed bounds. We also compare them with the aperiodicserver bounds, since aperiodic servers can be used for re-execution of tasks. As shownbelow, the overloading of the backups and the extra information used in deriving (25) and(18) are more pronounced in larger number of tasks and in larger backups. This is expected,since the slack introduced in our bounds takes advantage of knowing the task utilizationsand the time interval between faults.

6.1. Absolute Bound Values

For a given number of tasks, the boundU1-FT-RMS depends onUB and varies between aminimum and a maximum value. The minimum can be found by taking the derivative ofU1-FT-RMS (in (18)) with respect toUB.

dU1-FT-RMS

dUB= (n− 1)2

1n−1

(n

n− 1

)(1−UB)

nn−1−1(−1)− (n− 1)(−1)+ 1

= n− n[2(1−UB)]1

n−1 (28)

To find the minima, setdU1-FT-RMSdUB

= 0 to obtain

UB = 1/2

SubstitutingUB = 0.5 in (18),U1-FT-RMS = 0.5 is obtained. Therefore, the minimumutilization of the processor is 0.5, and occurs when one of the tasks has a total utilizationof 50% (shown in the lower curve of Figure 9).

Next the maximum value ofU1-FT-RMS is determined. It is known that fori = 1, . . . ,n,CiTi≤ UB and thus the maximum utilization isU = nUB which occurs whenCi

Ti= UB, i =

1, . . . ,n. UsingU = nUB and (18),

(n− 1)21

n−1 (1−UB)( n

n−1 ) + nUB − (n− 1) = nUB

⇒ 21

n−1 (1−UB)( n

n−1 ) − 1= 0

⇒ UB = 1−(

1

2

) 1n

(29)

Page 23: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 171

50

55

60

65

70

5 10 15 20 25 30 35 40 45 50

Utilization (%)

number of tasks (n)

Maximum value of U_1-FT-RMSMinimum value of U_1-FT-RMS

Figure 9. Maximum and minimumU1-FT-RMSfor a givenn.

SinceUB is equal to the maximum of all task utilizations,UB cannot be less than the valuein (29) if the processor is fully utilized. The maximum value ofU1-FT-RMSfor a given valueof n can be obtained by substituting (29) in (18) to obtain

Maximum value ofU1-FT-RMS= n

[1−

(1

2

) 1n

](30)

The plot of this equation can be seen in the upper curve of Figure 9.Thus, depending onn and on the task with maximum utilization, the value ofU1-FT-RMS

varies between 50% and≈ 69%. The bound is low for high values ofUB, and high for lowvalues ofUB. This means that, if any one of the tasks in the system has high utilization(close to 50%), then the total utilization of the system is lower. On the other hand, if theutilization of all tasks in the system are close to each other, then the total utilization of thesystem is higher. For example, consider a task set with 10 tasks. If the highest utilization ofall tasks (equal toUB) is 40%, then the least upper bound of the total utilization (U1-FT-RMS)is 51%, ifUB is 25%, thenU1-FT-RMSis 56%, and ifUB is 10%, thenU1-FT-RMSis 65%. Theminimum value ofUB for 10 tasks is 6.7% (from (29)). For that value ofUB, U1-FT-RMS

is 67%, which is the maximum possible for 10 tasks. As a comparison, the correspondingvalue ofULL for 10 tasks is 71%.

Page 24: Fault-Tolerant Rate-Monotonic Scheduling

172 GHOSH ET AL.

The above equations are valid when fault tolerance is required for all tasks in the system.However, fault tolerance may be provided to only some of the tasks by makingUB equalto the maximum of the utilizations of all tasks that need fault tolerance. The utilizations ofthe remaining tasks can actually be larger thanUB. The bounds (18) and (27) are valid forsuch cases since its derivation only assumes thatUB = Uk for some 1≤ k ≤ n.

Finally, it is shown thatU1-FT-RMS is always smaller thanULL for the same number oftasks. Equation (30) gives us the maximum value ofU1-FT-RMS for all possible values ofUB. The relations below follow immediately:

Maximum value ofU1-FT-RMS= n[21/n − 1]/21/n = ULL/21/n

⇒ U1-FT-RMS≤ ULL

21/n< ULL (31)

Note that, from (31) and from Figure 9, the maximum value ofU1-FT-RMS asymptoticallyapproachesULL , as a function ofn.

6.2. Comparison Among the Proposed FT Bounds

We compare the general bound (3) with the bound (18) derived with the assumption thatUB = Ui , for somei , and then we compare the general bound with the specialized bound(25) for multiple faults.

Figure 10 shows that for different values ofUB, U1-FT-RMS(18) is always larger (and thusenables the acceptance of more task sets) thanUG-FT-RMS(3). However, for small values ofUB, the two bounds generate close results.

The graph in Figure 11 shows the relationship betweenUM-FT-RMS (25) andUG-FT-RMS

(3) using two backups (m= 2), and fixing the utilization of one of the two backups at 10%.The utilization of the other backup and the total number of tasks are varied. For other valuesof m and other specific backup values, the graphs obtained are similar. Similar to Figure 10,the larger the backup utilizations and number of tasks, the larger the gap between the twobounds.

6.3. Comparison of FT-RMS and Aperiodic Server Bounds

Several schemes have been developed for scheduling aperiodic tasks (calledaperiodicservers) along with periodic tasks in real-time systems. For fault tolerance, the re-executionof a taskτi can be considered as an aperiodic task which arrives when a fault is detectedin τi . The aperiodic task scheduling techniques can then be used to re-executeτi . In thissection, the FT bounds presented in this paper are shown to be larger than the aperiodicserver bounds.

In the aperiodic server approaches, a certain portion of the processor utilization is reservedfor executing the aperiodic tasks. A server with a certain period and utilization is used toexecute aperiodic tasks. This server is used to keep track of the total computation timeavailable for executing aperiodic tasks at any instant of time.

Page 25: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 173

0.4

0.45

0.5

0.55

0.6

0.65

4 6 8 10 12 14 16 18 20

least upper bound utilization

n

U_B=0.2 1-FT-RMS

U_B=0.3 1-FT-RMS

U_B=0.4 1-FT-RMS

U_B=0.2 G-FT-RMS

U_B=0.3 G-FT-RMS

U_B=0.4 G-FT-RMS

Figure 10.Comparison of general and single fault equations.

0.4

0.45

0.5

0.55

0.6

0.65

0.7

4 6 8 10 12 14 16 18 20

least upper bound utilization

n

m = 2

U_B1=0.1 U_B2=0.1 2-FT-RMS

U_B1=0.2 U_B2=0.1 2-FT-RMS

U_B1=0.3 U_B2=0.1 2-FT-RMS

U_B1=0.1 U_B2=0.1 G-FT-RMS

U_B1=0.2 U_B2=0.1 G-FT-RMS

U_B1=0.3 U_B2=0.1 G-FT-RMS

Figure 11.Comparison of general and multiple fault equations.

Page 26: Fault-Tolerant Rate-Monotonic Scheduling

174 GHOSH ET AL.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Bound

Backup Utilization

n=10

LL RMS1-FT-RMSG-FT-RMS

PE/SSDS

Figure 12.Comparison of aperiodic servers, FT, and LL bounds.

For re-executing tasks, the approach described in this paper has several advantages overthe aperiodic servers. First, the schemes dealing with aperiodic tasks do not consider thedeadline of those tasks, but try to improve their response time. For fault tolerance, thedeadline of the re-executing task is important.

Second, we have compared the bounds developed for aperiodic servers with the boundsdeveloped in this paper, and have found a significant improvement. Figure 12 showsthe bounds of the different approaches for varying server utilizations. We assume thatthe backup utilization for re-execution of the task is equivalent to the aperiodic serverutilization. The figure shows the bounds for our 1-FT-RMS scheme (Equation 18), our G-FT-RMS scheme (Equation 3), the LL scheme (n(2

1n − 1)), the Priority Exchange scheme

((n− 1)([ 21+UB

]1

n−1 − 1)), and the Deferred Server scheme ((n− 1)([ 2+UB1+2UB

]1

n−1 − 1)).Third, recovery schemes have not been described for guaranteeing re-execution of faulty

tasks in the aperiodic server approaches. It may be possible to use the reserved processorutilization to re-execute tasks, but it is not clear what should be the period, utilization, andpriority of the aperiodic server to ensure that all periodic tasks and the re-executing taskmeet their deadlines.

Finally, the implementation of the recovery scheme described in this paper is simpler andhas less overhead. That is because there is no need for a server to keep track of the amountof processor utilization used for aperiodic tasks and there is no need to maintain the timeinstances at which to replenish the aperiodic servers.

Page 27: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 175

7. Simulation Studies

In this section, simulations are used to study the effects of adding slack to a system ofpreemptive, periodic tasks. If admission control is based on a bound that takes into consid-eration re-execution, then a smaller processor utilization (in the fault-free case) and a lowerrate of tasks missing their deadlines will ensue.

An event-driven simulator is used to simulate the running of tasks using RMS. The eventsinclude task arrival (the beginning of a period), task begin (when the task begins to run),task end (when the task completes running), and fault occurrence. When a taskτi ’s periodbegins,τi is inserted into a queue of tasks based on its priority. When the next event is thebeginning of a task, then the status of the task is changed fromscheduledto running. Faultsare injected into the system in the form of an event that raises a flag associated with therunning task. Whenτi completes its execution, it is removed from the queue if no faultsoccur. If a fault occurred duringτi ’s execution, another copy ofτi is inserted into the queue.To check ifτi has missed its deadline (due to a violation of fault assumptions), at the endof its period, any remaining portion ofτi is deleted from the queue. A counter keeps trackof the number of taskslost (unable to meet their deadlines) as a result of faults. A taskmay be lost due to an excessive number of faults (i.e., more faults than assumed) duringits execution, due to multiple faults during the execution of higher priority tasks, or if thevalue ofUB chosen by the system designer is not sufficient to re-execute the task.

One of the main goals of the simulation is to measure the effect of adding slack on theschedulability and fault tolerance capabilities of a task set scheduled using RMS. TheFTschemeintroduced above is compared with the Liu and Layland scheme which providesno fault tolerance (referred to as theLL schemefrom here onwards). For theFT scheme,theU1-FT-RMS bound is used to schedule tasks, and for theLL scheme, theULL bound isused. A comparison of these two schemes shows how much theFT schemeloses, in termsof scheduling fewer tasks, due to the additional fault tolerance capability. Also, the gain ofFT schemein terms of fewer lost tasks can be measured.

The input variables of the simulation are:

• UB: the backup utilization of the system.

• MTTF: the mean time to fault of the system, which is the average time after which thenext fault is expected to occur.

• Uav: the average task utilization.

The effect of the input variables are studied on the following metrics:

• Schedulability: the number of tasks that can be scheduled using theFT schemeandtheLL scheme.

• Utilization: the total utilization of the task set scheduled using the two schemes.

• Lost tasks: the number of task instances lost as a result of faults.

For each result, the simulation was run 500 times and the individual values were averaged.Each simulation was run for 100,000 time units. Note that schedulability is measured interms of periodic tasks, while the lost tasks are measured in terms of task instances.

Page 28: Fault-Tolerant Rate-Monotonic Scheduling

176 GHOSH ET AL.

7.1. Task Generation

A set of periodic tasks is generated for every run of the simulation. The utilizationUi

of each taskτi is uniformly distributed between 0 and 2Uav. The task periods,Ti , areuniformly distributed between 10 and 90, thus making the average period 50. The actualvalues of the periods are not as important as their relative values, because the results willbe nearly identical if all periods are multiplied by a constant factor, while maintainingtask utilizations. The computation timeCi of each task is generated using the formulaCi = Ui Ti .

The total number of tasks in the set is limited by the sum of their utilizations and the boundused to schedule the task set. Each task set is constructed by generating individual tasks andadding them to the task set until the utilization of the tasks in the set violates the appropriatebound (U1-FT-RMSor ULL ). Note that the backup utilizationUB is predetermined and is notequal to one of the task utilizations. Therefore, theU1-FT-RMS bound used here is only anapproximation since the actual task sets are randomly generated.

From here onwards, the task sets accepted using theFT schemeare called the FT tasksets, and those accepted using theLL schemeare called the LL task sets.

7.2. Comparison of theFT Schemeand theLL Scheme

First, the effect ofUB on schedulability is studied. AsUB increases, the schedulability of theFT schemedecreases. However, the schedulability of theLL schemeremains unchanged,since it does not takeUB into consideration. Figure 13 shows the schedulability of thetask sets as a function ofUB using the two bounds for different values ofUav. Note that,as conjectured, the decrease in schedulability is less sharp for larger average utilizations.Moreover, as expected, theLL schemeschedules more tasks than theFT schemefor allvalues ofUB. In all graphs shown in this sectionUB starts from 0.05, sinceUB = 0allows no fault tolerance. We have also noted that the processor utilization values forboth schemes correspond to the average number of scheduled tasks, since the utilizationvalues are approximately equal to the average task utilization times the average number ofscheduled tasks.

In the rest of this section, only tasks withUav of 10% (range from 0 to 20%) are studied.This range seems to be realistic considering the description of existing systems in studiessuch as Carpenter et al. (1994), and Locke, Vogel and Mesler (1991).

Since our experiments may dismiss the fault assumptions (recall that the fault distributionis uniform, thus allowing for faults to be close to each other), in the following graphs wecompare theFT schemepresented in this paper, theLL scheme, and a simpler recoveryscheme that re-executes the task at its own priority (OWN-PR scheme). As explainedin Section 2.2, theOWN-PR schememay cause tasks to miss deadlines even if the faultassumption holds. Below we want to investigate the amount of penalty a system wouldsuffer in case a simple recovery scheme is used.

In Figure 14, the percentage of task instances lost due to faults is shown. For a higherfault rate (smaller Mean Time To Fault, MTTF), the LL task sets lose a larger number ofincoming task instances. Even for lowerUB values, the FT task sets lose a smaller number

Page 29: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 177

0

2

4

6

8

10

12

14

16

18

20

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

av tasks scheduled

ub

LL Av U=0.05FT Av U=0.05LL Av U=0.1FT Av U=0.1LL Av U=0.2FT Av U=0.2

Figure 13.Average number of tasks scheduled using theLL schemeand theFT schemeas a function ofUB.

of task instances. In theFT scheme, the number of lost tasks decreases monotonically asUB increases. As the fault rate decreases (larger MTTF), the number of lost tasks by bothschemes decreases. As expected, theLL schemeand theOWN-PR schemealways lose moretasks than theFT schemefor all values of MTTF andUB.

To study the increase in lost tasks by using theLL schemeinstead of theFT schemeand theOWN-PR scheme, the difference in lost task instances is studied, normalized with respectto the number of task instances lost by theLL scheme, that is, LL lost−FT lost

LL lost , whereFT lostis either the lost tasks due to theFT schemetheOWN-PR scheme. In Figure 15, this valueis plotted for varyingUB. This graph gives us the normalized increase in the number oflost tasks relative to the number lost if backups are not used. For lower value of MTTF,the normalized increase is low (even though the absolute difference is higher, as seen inFigure 14). For higher values of MTTF, the normalized increase is larger.

Clearly, the value ofUB should be chosen with two factors in mind: the lost task rate andthe schedulability of the processor. Therefore, the value forUB depends on the relative costof losing tasks to that of not being able to admit a task due to schedulability bounds.

8. Concluding Remarks

In this paper, we have shown an approach to provide fault tolerance for preemptive tasks,while dispatching tasks according to the rate monotonic scheduling approach. New utiliza-tion bounds were developed, which can be used to schedule sets of preemptive tasks whileguaranteeing the re-execution of tasks when faults are detected. This is provided by adding

Page 30: Fault-Tolerant Rate-Monotonic Scheduling

178 GHOSH ET AL.

0

5

10

15

20

25

30

35

40

45

50

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

lost

ub

LL mttf=100OWN PR, mttf=100

FT, mttf=100LL mttf=500

OWN PR, mttf=500FT, mttf=500

Figure 14.Lost task instances using theLL schemeand theFT scheme.

slack to a schedule of periodic tasks such that every instance of each task is guaranteed totolerate single or multiple faults. The number of faults that can be tolerated depends on theamount of slack added. A simple recovery scheme was described, which can also be usedto tolerate software faults by re-executing a different version of the task if the first versionfails to meet its acceptance test. The scheduling scheme and the schedulability boundsremain the same, regardless of what software version is used.

We believe that the fault-tolerant RMS scheme is particularly useful in applications suchas avionics (Lachenmaier and Stretch, 1994) where RMS is used as a scheduling techniqueand retry is used as one of the fault tolerance mechanisms. The scheme is also useful insatellites which have a high rate of transient errors (Campbell, McDonald and Ray, 1992)and where RMS is used for navigation payload software (Doyle and Elzey, 1994).

In addition to tolerating transient faults, the scheme presented in this paper can be used intwo special cases: (1) to tolerate timing faults that occur when a task exceeds its assignedmaximum execution time, and (2) for the implementation of the imprecise computationparadigm (Liu et al., 1989). In such cases, the reserved slack can be used to complete thetask’s execution, and thus tolerate the timing faults up to a certain limit (equal to the backup

Page 31: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 179

0.2

0.4

0.6

0.8

1

1.2

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

(LL

lost

- F

T lo

st)/

LL lo

st

ub

FT, mttf=50FT, mttf=100FT, mttf=200FT, mttf=500

FT, mttf=1000

OWN PR mttf=50OWN PR mttf=100OWN PR mttf=200OWN PR mttf=500

OWN PR mttf=1000

Figure 15.Normalized increase in lost tasks by using theLL schemeinstead of theFT scheme.

length or at least twice the execution time). Similarly, if the worst case execution time formost real-time tasks in a system is much longer than the average execution time, then theaverage case can be used as its computation time while scheduling the task. The maximumdifference between the worst case and average computation times or the maximum differ-ence between optional and mandatory parts of imprecise computations (of all tasks) canbe assigned as a backup. If any of the tasks exceed their assigned computation time, thenthe backup can be used to execute the task to completion. If the probability of such timingfaults is low, then the number of backups can be small, since not many tasks will exceedtheir computation time. If this probability is large, then several backups may be needed.The number of backups is also a function of the criticality of the tasks.

Acknowledgments

Supported in part by NSF grant CCR-9308886; also part of the FORTS project, fundedby DARPA under contract DABT63-96-C-0044. A preliminary version of this paper waspresented at Dependable Computing for Critical Applications Conference (DCCA 97).

Page 32: Fault-Tolerant Rate-Monotonic Scheduling

180 GHOSH ET AL.

Notes

1. Other studies have dealt with real-time fault-tolerant scheduling using a timeline and a primary-backup ap-proach (Ghosh, Melhem and Moss´e, 1995; Krishna and Shin, 1986; Liestman and Campbell, 1988; Mehra etal., 1995). Since these works are not related to RMS, we do not discuss them here. We also do not discussOh and Son (1994), which deals with fault-tolerant scheduling using RMS on a multiprocessor, since it onlypresents a discussion for permanent faults.

2. A utilization boundU allows a simple test of the form∑

Ui ≤ U to determine a task set’s schedulability.For example, in RMS,U = n(21/n − 1).

3. Overloading is defined in Ghosh, Moss´e and Melhem (1997) as the sharing of a time interval among severaltasks.

4. Note that the difference between FTRMS and IBRMS is that the latter is an abstraction of the execution orderof tasks including backups, and the former is the actual execution order of tasks in case a fault occurs.

5. Note that swapping, like the backups or slack, is just an abstraction, and is used simply to demonstrate thatthe scheme works. Swapping will automatically take place in the fault-free case in RMS, whenever the slackis not used for re-executing a task.

6. Recall that in Liu and Layland (1973), all task utilizations were variable.

References

Audsley, N. C., Burns, A., Richardson, M. F., and Tindell, K. 1993. Applying new scheduling theory to staticpriority pre-emptive scheduling.Software Engineering Journal8(5): 284–292.

Sha, L., Sprunt, B., and Lehoczky, J. 1989. Aperiodic task scheduling for hard-real-time systems.Journal ofReal-Time Systems: 27–60.

Baldwin, J. T. 1995. Predicting and estimating real-time performance.Embedded Systems Programming8(2).Burns, A., Davis, R., and Punnekkat, S. 1996. Feasibility analysis of fault-tolerant real-time task sets. In8th

Euromicro Workshop on Real-Time Systems.Campbell, A., McDonald, P., and Ray, K. 1992. Single event upset rates in space.IEEE Trans. on Nuclear Science

39(6):1828–1835.Carpenter, T., Driscoll, K., Hoyme, K., and Carciofini, J. 1994. ARINC 659 scheduling: problem definition. In

Real-Time Systems Symposium, IEEE, pp. 165–169.Castillo, X., McConnel, S. R., and Siewiorek, D. P. 1982. Derivation and caliberation of a transient error reliability

model. IEEE Trans. on ComputersC-31(7): 658–671.Doyle, L., and Elzey, J. 1994. Successful use of rate monotonic theory on a formidable real time system. In11th

IEEE Workshop on Real-Time Operating Systems and Software, pp. 74–78.Gaisler, J. 1994. Concurrent error-detection and modular fault-tolerance in a 32-bit processing core for embedded

space flight applications. InSymp. on Fault Tolerant Computing (FTCS-24), IEEE, pp. 128–130.Ghosh, S. 1996. Guaranteeing fault tolerance through scheduling in real-time systems. Ph.D. Thesis, University

of Pittsburgh, ftp://cs.pitt.edu/realtime/ghosh-diss.ps.gz.Ghosh, S., Melhem, R., and Moss´e, D. 1995. Enhancing real-time schedules to tolerate transient faults. In

Real-Time Systems Symposium.Ghosh, S., Melhem, R., and Moss´e, D. 1997. Fault tolerance through scheduling of aperiodic tasks in hard

real-time multiprocessor systems.IEEE Transactions on Parallel and Distributed Systems. 8(3):272–284.Hecht, M. S., Hammer, J. B., Locke, C. D., Dehn, J. D., and Bohlmann, R. 1994. Rate monotonic analysis of a

large, distributed system. InIEEE Workshop on Real-Time Applications, IEEE, pp. 4–7.Iyer, R. K., Rossetti, D. J., and Hsueh, M. C. 1986. Measurement and modeling of computer reliability as affected

by system activity.ACM Trans. on Computer Systems4(3):214–237.Johnson, B. W. 1989.Design and Analysis of Fault Tolerant Digital Systems. Addison Wesley Pub. Co., Inc.Kane, J. R., and Yau, S. S. 1975. Concurrent software fault detection.IEEE Transactions on Software Engineering

SE-1(1):87–99.Kopetz, H. 1995. Automotive electronics–present state and future prospects. InFTCS 25.Kopetz, H., Kantz, H., Grunsteidl, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS.

In Symp. on Fault Tolerant Computing (FTCS-20), IEEE, pp. 466–473.

Page 33: Fault-Tolerant Rate-Monotonic Scheduling

FAULT-TOLERANT RATE-MONOTONIC SCHEDULING 181

Krishna, C. M., and Shin, K. 1986. On scheduling tasks with a quick recovery from failure.IEEE Trans onComputers35(5):448–455.

Krishna, C. M., and Singh, A. D. 1993. Reliability of checkpointed real-time systems using time redundancy.IEEE Trans. on Reliability42(3):427–435.

Lachenmaier, R., and Stretch, T. 1994. The IEEE scalable coherent interface: An approach for a unified avionicsnetwork. InAdvanced Packaging Concepts for Digital Avionics.

Lehoczky, J. P., Sha, L., and Ding, Y. 1989. The rate monotonic scheduling algorithm: Exact characterization andaverage case behavior. InReal Time Systems Symposium, pp. 166–171.

Lehoczky, J. P., Sha, L., and Strosnider, J. K. 1987. Enhanced aperiodic responsiveness in hard-real-time envi-ronments. InReal-Time Systems Symposium, pp. 261–270.

Liestman, A. L., and Campbell, R. H. 1988. A fault-tolerant scheduling problem.Trans Software EngineeringSE-12(11):1089–1095.

Liu, C. L., and Layland, J. 1973. Scheduling algorithm for multiprogramming in a hard real-time environment.Journal of the ACM20(1):46–61.

Liu, J. W., Lin, K., Liu, C. L., and Gear, C. W. 1989. Research on imprecise computations in project quartZ. InProc. of Workshop on Operating Systems for Mission Critical Computing.

Locke, C. D., Vogel, D. R., and Mesler, T. J. 1991. Building a predictable avionics platform in Ada: A case study.In Real-Time Systems Symposium, IEEE, pp. 181–189.

Mahmood, A., and McCluskey, E. J. 1988. Concurrent error detection using watchdog processors—A survey.IEEE Transactions on Computers37(2):160–174.

Mehra, A., Rexford, J., Ang, H. S., and Jahanian, F. 1995. Design and evaluation of a window-consistent replicationservice. InProc. Real-Time Technology and Applications Symp.

Miremadi, G., and Torin, J. 1995. Evaluating processor-behavior and three error-detection mechanisms usingphysical fault-injection.IEEE Transactions on Reliability44(3):441–453.

Mosse, D., Melhem, R., and Ghosh, S. 1994. Analysis of a fault-tolerant multiprocessor scheduling algorithm. In24th Int’l Symposium on Fault-Tolerant Computing, Austin, TX, IEEE.

Oh, S. K., and MacEwen, G. 1992. Toward fault-tolerant adaptive real-time distributed systems. ExternalTechnical Report 92-325, Department of Computing and Information Science, Queen’s University, Kingston,Ontario, Canada.

Oh, Y. 1994. The design and analysis of scheduling algorithms for real-time and fault-tolerant computer systems.Ph.D. Thesis, University of Virginia.

Oh, Y., and Son, S. H. 1994. Enhancing fault-tolerance in rate-monotonic Scheduling.The Journal of Real-TimeSystems7(3):315–329.

Pandya, M., and Malek, M. 1994. Minimum achievable utilization for fault-tolerant processing of periodic tasks.Technical Report TR 94-07, Univ of Texas at Austin, Dept of Computer Science.

Pradhan, D. K. 1986.Fault Tolerant Computing: Theory and Techniques. Prentice-Hall, NJ.Ramos-Thuel, S. 1993. Enhancing fault tolerance of real-time systems through time redundancy. Ph.D. Thesis,

Carnegie Mellon University.Ramos-Thuel, S., and Strosnider, J. K. 1995. Scheduling fault recovery operations for time-critical applications.

In 4th IFIP Conference on Dependable Computing for Critical Applications.Randell, B. 1975. System structure for software fault tolerance.IEEE Trans. on Software EngineeringSE-

1(2):220–232.Schuette, M. A., and Shen, J. P. 1987. Processor control flow monitoring using signatured instruction streams.

IEEE Transactions on ComputersC-36(3):264–275.Siewiorek, D. P., Kini, V., Mashburn, H., McConnel, S., and Tsao, M. 1978. A case study of C.mmp, Cm∗,

and C.vmp: Part 1—Experiences with fault tolerance in multiprocessor systems.Proceedings of the IEEE66(10):1178–1199.

Tindell, K. W. 1994.Fixed Priority Scheduling of Hard Real-Time Systems. PhD thesis, Univ of York, UK.Yau, S. S., and Chen, F. C. 1980. An approach to concurrent control flow checking.IEEE Transactions on

Software EngineeringSE-6(2):126–137.