4
Evaluating Impact of Human Errors on the Availability of Data Storage Systems Mostafa Kishani, Reza Eftekhari, and Hossein Asadi Data Storage, Networks, & Processing (DSN) Lab, Department of Computer Engineering, Sharif University of Technology Abstract—In this paper, we investigate the effect of incorrect disk replacement service on the availability of data storage systems. To this end, we first conduct Monte Carlo simulations to evaluate the availability of disk subsystem by considering disk failures and incorrect disk replacement service. We also propose a Markov model that corroborates the Monte Carlo simulation results. We further extend the proposed model to consider the effect of automatic disk fail-over policy. The results obtained by the proposed model show that overlooking the impact of incorrect disk replacement can result up to three orders of magnitude unavailability underestimation. Moreover, this study suggests that by considering the effect of human errors, the conventional believes about the dependability of different RAID mechanisms should be revised. The results show that in the presence of human errors, RAID1 can result in lower availability compared to RAID5. I. I NTRODUCTION Human errors have significant impact on the availability of Information systems [1], [2], [3] where some field studies have reported that 19% of system failures are caused by human errors [4], [3]. In large data-centers with Exa-Byte (EB) storage capacity (by employing more than one million disk drives), one should expect at least a disk failure per hour. Despite using mechanisms such as automatic fail-over in modern data- centers, in many cases the role of human agents is inevitable. Meantime, the probability of human error, even by using precautionary mechanisms such as checklists and employing high-educated and high-trained human resources, is between 0.1 and 0.001 [5], [6], [7], [8]. Such statistics translate that an exascale data-center will face multiple human errors a day. Disk drives are of most vulnerable components in a Data Storage System (DSS). Disk failures and Latent Sector Errors (LSEs) [9] are of main sources of data loss in a disk subsystem. Several studies have tried to investigate the effect of these two incidences on a single disk and disk array reliability [9], [10], [11], [12]. In particular, the failure root cause breakdown in previous studies [4], [3] shows that human error is of great importance. In this paper, we propose an availability model for the disk subsystem of a Backed-up data storage system 1 by consider- ing the effect of disk failures and human errors. While the incorrect repair service can have many different roots and happen in many different conditions, in this work we just consider the incorrect disk replacement service and call it 1 An storage system that keeps an updated backup of data, for example on a tape. In such system, we assume that data loss can be recovered using the backup and has just an unavailability consequence. Wrong Disk Replacement. In our analysis, both disk subsys- tems with and without automatic disk fail-over are considered. The proposed analytical technique is based on Markov models and hence requires the assumption of exponential distributions for time-to-failure and time-to-restore. Furthermore, to cope with other probability distribution functions such as Weibull, that describes the disk failure behavior in a more realistic manner [12], we have developed a model based on Monte- Carlo (MC) simulations. This model has also been used as a reference to validate the proposed Markov model when using exponential distributions. By incorporating the impact of human errors on the avail- ability of disk subsystem, several important observations are obtained. First, it is shown that overlooking the impact of incorrect repair service will result in a considerable under- estimation (up to 263X) of the system downtime. Second, it is observed that in the presence of human errors, conventional assumptions on the availability ranking of different Redundant Array of Independent Disks (RAID) configurations can be contradicted. Third, it is demonstrated that automatic disk fail- over can significantly improve the overall system availability when on-line rebuild is provided by using spare disks. The remainder of this paper is organized as follows. Section II represents a background on human errors. Sec- tion III elaborates the Monte-Carlo simulation-based model. Section IV presents the proposed Markov models, considering the impact of human errors. Section V provides simulation results and the corresponding findings. Lastly, Section VI concludes the paper. II. BACKGROUND A. Human Error in Safety-Critical Applications To better understand and quantify human errors in a non- benign system, the Human Reliability Assessment (HRA) [13] techniques have been developed where its major focus is the quantification of Human Error Probability (hep) which is simply defined by the fraction of error cases observed, over the opportunities for human errors [6]. By collecting hep values obtained by NASA, EUROCONTROL, and NUREG, we found that human error has usually a probability in the range of 0.001 to 0.1 depending on the application and situation. This probability mainly varies from 0.001 up to 0.01 in enterprise and safety-critical applications [6], [7], [8], [5]. B. Human Errors in Data Storage Systems While human errors in data storage systems can happen in very different situations, in this work we focus on one of arXiv:1806.01360v1 [cs.PF] 26 May 2018

Evaluating Impact of Human Errors on the Availability of

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Evaluating Impact of Human Errors on theAvailability of Data Storage Systems

Mostafa Kishani, Reza Eftekhari, and Hossein AsadiData Storage, Networks, & Processing (DSN) Lab, Department of Computer Engineering, Sharif University of Technology

Abstract—In this paper, we investigate the effect of incorrectdisk replacement service on the availability of data storagesystems. To this end, we first conduct Monte Carlo simulationsto evaluate the availability of disk subsystem by consideringdisk failures and incorrect disk replacement service. We alsopropose a Markov model that corroborates the Monte Carlosimulation results. We further extend the proposed model toconsider the effect of automatic disk fail-over policy. The resultsobtained by the proposed model show that overlooking the impactof incorrect disk replacement can result up to three orders ofmagnitude unavailability underestimation. Moreover, this studysuggests that by considering the effect of human errors, theconventional believes about the dependability of different RAIDmechanisms should be revised. The results show that in thepresence of human errors, RAID1 can result in lower availabilitycompared to RAID5.

I. INTRODUCTION

Human errors have significant impact on the availability ofInformation systems [1], [2], [3] where some field studieshave reported that 19% of system failures are caused by humanerrors [4], [3]. In large data-centers with Exa-Byte (EB) storagecapacity (by employing more than one million disk drives),one should expect at least a disk failure per hour. Despiteusing mechanisms such as automatic fail-over in modern data-centers, in many cases the role of human agents is inevitable.Meantime, the probability of human error, even by usingprecautionary mechanisms such as checklists and employinghigh-educated and high-trained human resources, is between0.1 and 0.001 [5], [6], [7], [8]. Such statistics translate thatan exascale data-center will face multiple human errors a day.

Disk drives are of most vulnerable components in a DataStorage System (DSS). Disk failures and Latent Sector Errors(LSEs) [9] are of main sources of data loss in a disksubsystem. Several studies have tried to investigate the effectof these two incidences on a single disk and disk arrayreliability [9], [10], [11], [12]. In particular, the failure rootcause breakdown in previous studies [4], [3] shows that humanerror is of great importance.

In this paper, we propose an availability model for the disksubsystem of a Backed-up data storage system1 by consider-ing the effect of disk failures and human errors. While theincorrect repair service can have many different roots andhappen in many different conditions, in this work we justconsider the incorrect disk replacement service and call it

1An storage system that keeps an updated backup of data, for example ona tape. In such system, we assume that data loss can be recovered using thebackup and has just an unavailability consequence.

Wrong Disk Replacement. In our analysis, both disk subsys-tems with and without automatic disk fail-over are considered.The proposed analytical technique is based on Markov modelsand hence requires the assumption of exponential distributionsfor time-to-failure and time-to-restore. Furthermore, to copewith other probability distribution functions such as Weibull,that describes the disk failure behavior in a more realisticmanner [12], we have developed a model based on Monte-Carlo (MC) simulations. This model has also been used as areference to validate the proposed Markov model when usingexponential distributions.

By incorporating the impact of human errors on the avail-ability of disk subsystem, several important observations areobtained. First, it is shown that overlooking the impact ofincorrect repair service will result in a considerable under-estimation (up to 263X) of the system downtime. Second, itis observed that in the presence of human errors, conventionalassumptions on the availability ranking of different RedundantArray of Independent Disks (RAID) configurations can becontradicted. Third, it is demonstrated that automatic disk fail-over can significantly improve the overall system availabilitywhen on-line rebuild is provided by using spare disks.

The remainder of this paper is organized as follows.Section II represents a background on human errors. Sec-tion III elaborates the Monte-Carlo simulation-based model.Section IV presents the proposed Markov models, consideringthe impact of human errors. Section V provides simulationresults and the corresponding findings. Lastly, Section VIconcludes the paper.

II. BACKGROUND

A. Human Error in Safety-Critical Applications

To better understand and quantify human errors in a non-benign system, the Human Reliability Assessment (HRA) [13]techniques have been developed where its major focus is thequantification of Human Error Probability (hep) which issimply defined by the fraction of error cases observed, over theopportunities for human errors [6]. By collecting hep valuesobtained by NASA, EUROCONTROL, and NUREG, we foundthat human error has usually a probability in the range of0.001 to 0.1 depending on the application and situation. Thisprobability mainly varies from 0.001 up to 0.01 in enterpriseand safety-critical applications [6], [7], [8], [5].

B. Human Errors in Data Storage Systems

While human errors in data storage systems can happenin very different situations, in this work we focus on one of

arX

iv:1

806.

0136

0v1

[cs

.PF]

26

May

201

8

A:11

ROpr Opr Opr

ROpr Opr

Opr

ROpr

Opr

OprR

Opr

Opr

Opr

Ta

pe R

ec

ove

ry Opr

Opr

Opr R

R

R

Opr R

Opr R

Opr R

Opr R

Opr R

Down Time (DL)

Opr R

Opr

Opr

Opr

All d

isks

are

op

era

tion

al

Down Time (DL)

Ta

pe

Rec

ov

ery

Disk Failure R Disk Replacement and Data RebuildOpr Operational: Disk works correctly

R

R

R

Disk1

Disk2

Disk3

Disk4

Opr

Opr

Opr

Opr

Tap

e R

eco

ve

ry

Wrong Disk Replacement

(Human Error)

Tap

e R

eco

ve

ry

Down Time (DU)Down Time (DU)

100

150

200

270

400

320

500

550

590

640

648

710

770

790

805

820

885

893

923

407

326

437

678

Fig. 4: An Example: MC Simulation to Calculate Availability of a RAID5 3+1 Array inPresence of Human Errors (rebuild time = 10h)

6.1. Proposed Simulation-Based Availability Model in Presence of Human ErrorsOne of the most common reported errors by field engineers is wrong disk replacement[Haubert, 2004]. When a disk fails in a RAID array (e.g., a RAID5 array), the arraycontinues to operate in the exposed state. Upon detection of the failed disk, a techni-cian needs to replace it by a spare disk and run a script (or a command) to start therecovery process by reconstructing the failed disk data from other operational disks.During the recovery process, the technician is supposed to follow a predefined proce-dure. Any mistake by the technician during the recovery process such as a wrong diskreplacement or running a wrong script can significantly affect the availability of thedisk subsystem. For example, in a RAID5 array, a wrong disk replacement or execut-ing a wrong script before the completion of the recovery process can make the arrayunavailable. Therefore, in addition to double disk failures which can make the arrayunavailable, a human error such as a wrong disk replacement right after a disk failurecan also significantly affect the overall system availability.

In order to investigate the impact of human errors on the availability of disk sub-systems, we extend our proposed simulation-based availability model as follows. First,the disk failure events are generated according to the desired failure distribution (ex-ponential, Weibull, etc). Next, the expected completion time of each recovery processis extracted considering the repair distribution and repair rate. Lastly, human er-ror events are generated considering human error probability along with each failureevents. After generation of failure, recovery, and human error events, we start travers-ing all events in sequence.

Fig. 4 shows an example which illustrates how a wrong disk replacement can makea RAID5 array unavailable. In this example, in addition to data loss events due todouble disk failures at times 407h and 893h, two wrong disk replacements lead to twoData Unavailability (DU ) events at times 326h and 648h. Note that both data lossand data unavailability events will impose downtime to the storage system. Similarto the reference model described in Section 5, Equation 2 can be used to calculate theavailability of the disk subsystem using MC simulations in the presence of humanerrors.

6.2. Proposed Markov Model of RAID5 Configuration in Presence of Human ErrorsConsidering an exponential distribution for both failure and repair rates, we furtherextend the previous Markov model presented in Section 3 (Fig. 2(a)) considering theeffect of human errors on the availability of disk subsystems. The proposed Markov

ACM Transactions on Storage, Vol. V, No. N, Article A, Publication date: January YYYY.

Fig. 1: MC Simulation to Calculate Availability of a RAID53+1 Array in Presence of Human Errors (rebuild time = 10h)

the most prevalent samples, the wrong disk replacement. In aRAID array, given RAID5, with no disk spare, the disk fail-over process can start after replacing the failed disk with abrand-new disk. Consider the case that the operator replacesthe brand-new disk with one of the operating disks, ratherthan the failed one. In this case, two disks are inaccessible(the failed disk and the operating, wrongly removed disk),making the entire data unavailable. However, detecting thehuman error and undoing the wrong disk replacement makesthe array available at no data loss cost.

In the next section, we describe a simulation-based referencemodel to evaluate the availability of data storage systemsconsidering disk failures and the effect of human errorshappening in disk replacement process.

III. AVAILABILITY OF A BACKED-UP DISK SUBSYSTEM BYMONTE-CARLO SIMULATION

In the MC model, the failure and repair events are generatedby assuming the desired distributions such as Weibull andexponential. Fig. 1 illustrates an example of the MC simulationfor a RAID5 (3+1) array. In case two consecutive disk failureshappen in the same array and the second failure is beforethe recovery of the first failure, a data loss event happens (asshown in Fig. 1). As we assume that the data storage system isbacked-up, we consider the data loss event as data unavailabil-ity, which duration is the data loss recovery (tape recovery inour example) time. In the case of single disk failure, the faileddisk is replaced by a human agent. However, the occurrence ofa human error in the disk replacement process makes anotherworking disk unavailable, resulting in the unavailability of theentire data array (in the case of RAID5), which duration variesby the human error recovery time. The overall availability ofthe disk subsystem is calculated by dividing the total disksubsystem uptime by the overall simulation time. The error ofMC simulations is inversely proportional to the root squareof the number of iterations and the t-student coefficient for atarget confidence level [14].

IV. AVAILABILITY OF A BACKED-UP DISK SUBSYSTEMUSING MARKOV MODEL

In this section, we propose a Markov Model for a backed-updisk subsystem availability that corroborates the Monte Carloreference model by assuming an exponential distribution forboth failure and repair rates. Finally, we extend the Markovmodel for a disk subsystem with automatic fail-over.

OP EXPDF

DLDDF

DUDF+hehep×µDF

n×λ (n-1)×λ

(1-hep)×µDF

(1-hep)×µhe

R4

R3R2R1

R1=1-(n×λ) R2=1-[(n-1)×λ+µ1] R3= 1-[(1-hep)×µ3] R4= 1-[(1-hep)×µ2]

λCrash

µDDFhep: Human Error Prob.λ: Disk Failure RateλCrash: Rate of Disk Crash after Human Errorµhe: Human Error Recovery RateµDF: Disk Failure Recovery RateµDDF: Double Disk Failure Recovery Raten: No. of Disks

Fig 5- Fig. 2: Markov Model for RAID5 Availability

A. Markov Model of RAID5 in Presence of Human ErrorsFig. 2 shows the proposed Markov model for the availability

of a backed-up disk subsystem by considering the effect ofdisk failures and human errors. In this model, disk failurerate, disk repair rate, double disk failure recovery rate fromprimary backup, and Human Error Probability are shown byλ, µDF , µDDF , and hep, respectively. Upon the occurrenceof the first disk failure, the system state will move from theoperational (OP ) to the exposed state (EXP ). While beingin the exposed state, a second disk failure will lead to a DataLoss (DL) event while a human error during disk replacementwill lead to a Data Unavailability (DU) event. If the humanagent successfully replaces the failed disk, the array returns tothe OP state.

When the array is in the DU state, the incidence of humanerrors during the fail-over process makes the array to stay inthe DU state. Otherwise, if no human error happens in thefail-over process, the array state transits to the OP state. Inthe DU state, if the wrongly replaced disk is crashed, the arrayswitches to DL. Finally, when the array is in the DL state, itcan be recovered by the rate of µDDF .

B. Markov Model of RAID5 With Automatic Fail-overHere, we study the effect of automatic disk fail-over when

on-line rebuild process is being performed using hot-sparedisks. In the conventional disk replacement policy, a faileddisk may be replaced by a new disk before the completion ofthe on-line rebuild process. In the automatic fail-over policyas opposed to the conventional disk replacement policy, thereplacement process should be started after the completionof on-line rebuild process. In automatic fail-over policy, it isassumed that a hot spare disk is available within the arraywhile the system is in the operational state.

Fig. 3 shows the Markov model of a RAID5 array employ-ing the automatic fail-over policy. The system is in the OPstate when all disks work properly and a spare disk is present.In the case of a disk failure, the array state switches to EXP1.In the EXP1 state, the system goes to either the DL stateby another disk failure or the OPns state if the failed disk isrebuilt into the available spare disk. In the OPns state, all disksof the array work properly but no spare is present. Automaticfail-over paradigm forbids the operator to replace the affecteddisk before the completion of on-line rebuild process. Hence,disk replacement can be performed at the states other thanthe EXP1 state and there is no possibility of human error inthe EXP1 state. When the system is in the OPns state, a diskfailure switches the system state to EXPns1. If the failed diskis successfully replaced by the new disk, the array returns tothe OP state. Otherwise, if a human error happens in the diskreplacement process, the array switches to EXPns2.

Fig 8 OP

EXP1DF

DLDDF

DUns1DF+he

µDF

µ3

(n-1)×λ

hep×µch

(1-hep)×µch

n×λ

n×λ

EXPns1

DF hep×(µDF+ µch)

(n-1)×λ(1-hep)×µDF

(n-1)×λ

OPns

EXPns2

HE

µDDF

(1-hep)×µhe

(1-hep)×µhe

R1

R2

R3

R4

R5

R7

R10

DUns2he+hehep×µhe

λCrash

2×λCrash

λCrash

(1-hep)×µhe

n: No. of Diskshep: Human Error Prob.λ: Disk Failure Rate

R11

(1-hep)×µch

DLnsDDF

R9

µDDF

(1-hep)×µch

µDDF

R8DU1

DF+he

(1-hep)×µch

(1-hep)×µhe

EXP2HE

µDDF

λCrash

(1-hep)×µhe

DU2he+he

(1-hep)×µhe

hep×µhe

2×λCrash

R6

µDF: Disk Failure Recov. RateµDDF: Double Disk Failure Recov. Rateµhe: Human Error Recov. Rate µch: Rate of Changing the failed disk

λCrash: Rate of Disk Crash after Human Error

Fig. 3: Markov Model of RAID5 Availability with AutomaticFail-over

In the EXPns1 state, the array has a failed disk and nospare. In this state, successful disk fail-over process changesthe array state to OPns. Upon the successful replacement offailed disk in the EXPns1 state, the array switches to EXP1.However, by happening a human error in the process of diskfail-over or in the process of failed disk replacement, the arrayswitches to DUns1. Upon a disk failure when the array is inthe EXPns1 state, the array switches to DLns.

In the EXPns2 state, one of the operational disks iswrongly replaced by the new disk due to human error. Toremove this error, the wrongly removed disk should be placedback and in return, the failed disk should be removed. Ifthis process happens successfully, the array goes back to OP .Otherwise, if another human error happens in the process ofrecovering the human error, the array switches to DUns2. Inthe EXPns2 state, if the wrongly removed disk crashes, thearray switches to EXPns1. Happening a disk failure when thearray is in the EXPns2 state switches the array to DUns1.

In the DL state, the user data is totally lost due to a DoubleDisk Failure (DDF) and a hot spare disk is available. In thiscase, DDF could be recovered by the rate of µDDF . Similarlyin the DLns state, the user data is totally lost due to a DDFbut no spare is available. Here, recovery from DDF changesthe array state to OPns. In the DLns state, if one of the faileddisks is successfully replaced by the new spare disk, the arrayswitches to the DL state.

In the DUns1 state, the array is totally unavailable due toa disk failure and a human error. In this state, the successfulrecovery of human error changes the array state to EXPns1.However, if the wrongly removed disk crashes, the arrayswitches to DLns. In the DUns1 state, performing the diskfail-over by using the disk array is not possible as the user datais unavailable due to the human error. In this case, performingthe disk fail-over before recovering the human error is similarto the case of recovering DDF by the rate of µDDF . In theDUns1 state, if the failed disk is successfully replaced by thenew spare disk, the array switches to the DU1 state.

In the DUns2 state, the array data is totally unavailabledue to the occurrence of two human errors. In this case, the

4.5

5

5.5

6

6.5

7

7.5

8

8.5

0.0

E0

5.0

E-7

1.0

E-6

1.5

E-6

2.0

E-6

2.5

E-6

3.0

E-6

3.5

E-6

4.0

E-6

4.5

E-6

5.0

E-6

5.5

E-6

Ava

ilab

ility

(#

of

nin

es)

Failure Rate

MC Simulation, hep=0.01Markov, hep=0.01

MC Simulation, hep=0.001Markov, hep=0.001

Fig. 4: Comparison of Reference and Markov Models

4

5

6

7

8

9

0 0.001 0.01

Ava

ilab

ility

(#

of

nin

es)

Human Error Probability

Failure Rates, Beta Value1.25E-6,1.092.17E-6,1.127.96E-6,1.212.00E-5,1.48

Fig. 5: Availability of a RAID5(3+1) Array

array can switch to EXPns2 if one of the human errorsis successfully recovered. However, if one of the wronglyremoved disks crashes, the array switches to DUns1.EXP2 is similar to EXPns2 except the point that the hot

spare disk is available in this state. Similarly, the DU1 andDU2 states are similar to DUns1 and DUns2, respectively,except the point that the hot spare disk is available in DU1

and DU2.Comparing the Markov model of a RAID5 array employing

the automatic fail-over (Fig. 3) and a RAID5 array usingthe conventional disk replacement policy (Fig. 2) shows alonger path from OP to DU state when the automatic fail-over is performed in the system. Hence, it can be realizedthat the probability of being in the DU state significantlydecreases by using the automatic fail-over policy. Detailednumerical results of this model and comparison with a RAIDarray employing the conventional replacement policy will bepresented in Section V-D.

V. EXPERIMENTAL SETUP AND SIMULATION RESULTS

A. Validation of Markov Model with Simulation-Based Model

Fig. 4 shows the comparison of the proposed MC simulationresults (for 106 iterations and 99% confidence level) and theavailability values obtained by the Markov model. As shownin this figure, the availability values obtained by the Markovmodel are within the error interval of the results obtained bythe MC simulations for both hep = 0.001 and hep = 0.01.B. Availability Estimation in Presence of Human Error

Fig. 5 reports the availability results of a RAID5 3+1 arrayin the presence of human errors for different disk failure rates.The availability of the disk subsystem has been reported forthe traditional availability model (assuming hep=0) as wellas two different human error probabilities (hep = 0.001 andhep = 0.01). We consider typical values for the repair ratein our experiments. In particular, we consider 0.1 and 0.03values for µDF and µDDF , respectively. We also consider

2

2.5

3

3.5

4

4.5

5

5.5

6

0 0.001 0.01A

vaila

bili

ty (

# o

f n

ines

)Human Error Probability

a) Failure Rate=1E-5

RAID1(1+1)RAID5(3+1)RAID5(7+1)

2

4

6

8

10

12

14

0 0.001 0.01Human Error Probability

b) Failure Rate=1E-6

RAID1(1+1)RAID5(3+1)RAID5(7+1)

2

4

6

8

10

12

14

0 0.001 0.01Human Error Probability

c) Failure Rate=1E-7

RAID1(1+1)RAID5(3+1)RAID5(7+1)

Fig. 6: Comparison of Availability of RAID Configurationswith Equivalent Usable Capacity and Exponential Disk Failure

µs = 1, µhe = 1, and λcrash = 0.01. Considering a constantfailure rate (for example, λ = 10−6), it is observed that theavailability of a disk subsystem is inversely proportional withhuman error probability. As the results show, with the humanerror probability equal to 0.001, the availability of the disksubsystem drops between one to two orders of magnitude.

C. Availability Comparison of RAID Configurations withEquivalent Usable Capacity

Fig. 6 compares the availability of three different RAIDconfigurations including R5(3+1), R5(7+1), and R1(1+1)with equivalent usable (logical) capacity, in the presence ofhuman errors (hep=0, hep=0.001, and hep=0.01), assumingexponential failure distribution (λ = 10−5). Comparing thethree RAID configurations by assuming no human errors(hep = 0) shows that RAID1(1 + 1) results in a higheravailability compared to RAID5(3 + 1) and RAID5(7 + 1).However, by considering hep = 0.001, the availability of allRAID configurations dramatically decrease, while our resultsshow a more significant decrease in the RAID1(1 + 1)configuration, making its availability slightly lower than bothRAID5(3+1) and RAID5(7+1) configurations. This can bedescribed by the higher Effective Replication Factor2 (ERF)of RAID1(1 + 1) (ERF = 2) compared to RAID5(3 + 1)(ERF = 1.33) and RAID5(7 + 1) (ERF = 1.14), whichmandates employing higher number of disks for a specificusable capacity, increasing the chance of disk failure andconsequently, human errors. By considering higher hep values(e.g., 0.01), we observe more gap between RAID configura-tions where both RAID1(1 + 1) and RAID5(5 + 1) showlower availability compared to RAID5(7+1), that can againbe described by the lower ERF of RAID5(7 + 1).

D. Effect of Automatic Disk Fail-over PolicyIn this section, we report the effect of the automatic fail-over

policy when on-line rebuild process is being performed usinghot-spare disks. Fig. 7 compares the availability of two RAID5arrays, performing conventional and automatic fail-over in thepresence of human errors. As the results show, using automaticfail-over policy can significantly moderate the effect of humanerrors. For example, assuming hep = 0.01, automatic fail-overincreases the system availability by two orders of magnitudeas compared to the conventional disk replacement policy. Theresults reported in Fig. 7 also demonstrate that the delayedreplacement policy shows higher availability improvementwhen hep has greater values.

2The ratio of storage physical size to the logical (usable) size [15].

5.5

6

6.5

7

7.5

8

8.5

hep=0 hep=0.001 hep=0.01

Ava

ilab

ility

(#

of

nin

es)

Human Error Probability

Delayed-Disk-ReplacementConventional-Disk-Replacement

Fig. 7: Availability of Automatic Fail-over PolicyVI. CONCLUSION AND FUTURE WORKS

In this paper, we investigated the effect of incorrect diskreplacement service on the availability of a backed-up disksubsystem by using Monte Carlo simulations and Markovmodels. By taking the effect of incorrect disk replacementservice into account, it is shown that a small percentage ofhuman errors (e.g., hep = 0.001) can increase the systemunavailability by more than one order of magnitude. Usingthe proposed models, it is also shown that in some cases thedependability ranking of RAID configurations is not as con-ventional. Additionally, it is shown that automatic fail-over canincrease the system availability by two orders of magnitude.Such observations can be used by both designers and systemadministrators to enhance the overall system availability.

REFERENCES

[1] D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why do internetservices fail, and what can be done about it?” in USENIX symposiumon internet technologies and systems, vol. 67. Seattle, WA, 2003.

[2] A. Brown and D. A. Patterson, “To err is human,” in EASY, 2001.[3] D. Oppenheimer, “The importance of understanding distributed system

configuration,” in Conference on Human Factors in Computer Systemsworkshop, 2003.

[4] E. Haubert, “Threats of Human Error in a High-Performance StorageSystem: Problem Statement and Case Study,” Computing ResearchRepository, vol. abs/cs/041, 2004.

[5] M. P. A. B. E. M. P. M. Faith Chandler, I. Addison Heard,“Nasa human error analysis,” Tech. Rep., 2010. [Online]. Available:www.hq.nasa.gov/office/codeq/rm/docs/hra.pdf

[6] W. Gibson, B. Hickling, and B. Kirwan, “Feasibility study into thecollection of human error probability data,” EEC Note, vol. 2, 2006.

[7] U. N. R. Commission, Reactor Safety Study: An Assessment of AccidentRisks in US Commercial Nuclear Power Plants, 1975, vol. 2.

[8] A. D. Swain and H. E. Guttmann, “Handbook of human-reliabilityanalysis with emphasis on nuclear power plant applications. final report,”Sandia National Labs., Albuquerque, NM (USA), Tech. Rep., 1983.

[9] B. Schroeder, S. Damouras, and P. Gill, “Understanding latent sectorerrors and how to protect against them,” Transaction on Storage (TOS),vol. 6, pp. 9:1–9:23, September 2010.

[10] J. Elerath and M. Pecht, “A highly accurate method for assessingreliability of redundant arrays of inexpensive disks (raid),” Transactionson Computers (TC), vol. 58, no. 3, pp. 289–299, March 2009.

[11] K. M. Greenan, J. S. Plank, and J. J. Wylie, “Mean time to meaningless:Mttdl, markov models, and storage system reliability,” HotStorage, pp.1–5, 2010.

[12] B. Schroeder and G. A. Gibson, “Disk failures in the real world: whatdoes an mttf of 1,000,000 hours mean to you?” FAST, pp. 1–16, 2007.

[13] A. D. Swain, “Human reliability analysis: Need, status, trends andlimitations,” Reliability Engineering & System Safety, vol. 29, no. 3,pp. 301–313, 1990.

[14] K. L. Lange, R. J. Little, and J. M. Taylor, “Robust statistical modelingusing the t distribution,” Journal of the American Statistical Association,vol. 84, no. 408, pp. 881–896, 1989.

[15] S. Muralidhar, W. Lloyd, S. Roy, C. Hill, E. Lin, W. Liu, S. Pan,S. Shankar, V. Sivakumar, L. Tang et al., “f4: Facebook’s warm blobstorage system,” in OSDI, 2014, pp. 383–398.