31
Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University 1

Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

Embed Size (px)

Citation preview

Page 1: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

1

Ensieea Rizwani

Disk Failures in the real world: What does an MTTF of 1,000,000 hours

mean to you?

Bianca Schroeder Garth A. Gibson

Carnegie Mellon University

Page 2: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

2

Motivation O Storage failure can not only cause

temporary data unavailability, but permanent data loss.

O Technology trends and market forces may make storage system failures occur more frequently in the future.

Page 3: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

What is Disk Failure?O Assumption: Disk failures follow a simple “fail-

stop model”, where disk either work perfectly or fail absolutely and in an easily detectable manner.

O Disk failures are much more complex in reality.

OWhy ???

3

Page 4: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

4

Complexities of Disk failure

OOften it is hard to correctly attribute the root cause of a problem to a particular hardware component.

O Example:O If a drive experiences a latent sector

faults or transient performance problem, it is often hard to pin point the root cause.

Page 5: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

5

Complexities of Disk failure

O There is not a unique definition of when a drive is faulty.

O Example:O Customers and vendors might use

different definitions.O A common way for a customer to test a

drive is to read all of its sectors to see if any reads experience problems, and decide that it is faulty if any one operation takes longer than a certain threshold.

O Many sites follow a “better safe than sorry” mentality, and use even more rigorous testing.

Page 6: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

6

Challenge!Unfortunately, many aspects of disk failures in real systems are not well understood, probably because the owners of such systems are reluctant to release failure data or do not gather such data.

Page 7: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

7

Research data for this paperRequest to a number of large production sites and were able to convince several of them to provide failure data from some of their systems

O Dataset in the article provides an analysis of seven data sets collectedO high-performance computing sites O large internet services sites.

Page 8: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

8

Data Sets O Consist of primarily hardware

replacement logsO Vary in duration from one month to

five yearsO Cover a population of more than

100,000 drives from at least four different vendors

O Include drives with SCSI, FC and SATA interfaces.

Page 9: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

9

Research data sets

o Research is based on hardware replacement records and logs.

o Article analyzes records from number of large production systems, which contain a record count and conditions for every disk that was replaced in the system during the data collection.

Page 10: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

10

Data set Analysis

After a disk drive is identified as the likely culprit in a problem, the operations staff perform a series of tests on the drive to assess its behavior. If the behavior qualifies as faulty according to the customer’s definition, the disk is replaced and a corresponding entry is made in the hardware replacement log.

Page 11: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

11

Failure Depending FactorsO Operating Conditions

Environmental factorsOTemperatureOHumidity

Data Handling proceduresO Work loadsO Duty cycles or powered-on hours

patterns

Page 12: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

12

Effect of Bad BatchO Failure behavior of disk drive may differ

even if they are of the same modelO Changes in manufacturing process and

parts could play huge roleO Drive’s hardware or firmware

componentO Assembly line on which a drive was

manufactured.O A bad batch can lead to unusually high

drive failure rates or high media error rates.

Page 13: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

13

Bad Batch ExampleO HPC3, one of the data set customers

had 11,000 SATA drives replaced in Oct. 2006

O high frequency of media errors during writes.

O It took a year to resolve.O The customer and vendor agreed

that these drives did not meet warranty conditions.

O The cause was attributed to the breakdown of a lubricant during manufacturing leading to unacceptably high head flying heights.

Page 14: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

14

Bad BatchO Effect of batches is not analyzed. O Research report is on the field

experience, in terms of disk replacement rates, of a set of drive customers. O Customers usually do not have the

information necessary to determine which of the drives they are using come from the same or different batches. Since the data spans a large number of drives (more than 100,000) and comes from a diverse set of customers and systems.

Page 15: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

15

Research Failure Data Sets

Page 16: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

16

Reliability MetricsO Annualized Failure rate (AFR) scaled

to yearly estimation and the Mean time to failure (MTTF)

O Annual Replacement Rate (ARR)

O MTTF = Power on Hours / AFR

O Fact: MTTFs specified for today’s high-test quality disks range from 1,000,000 hours to 1,500,000

Page 17: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

17

Hazard rate h(t)

O Hazard rate between Disk replacement

O h(t) = f(t) / 1 – F(t)

O (t) denotes time between failures, O h(t) describes instantaneous failure rate

since the most recently observed failure.O F(t) cumulative distribution function

Page 18: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

18

Hazard rate AnalysisO constant hazard rate implies that

the probability of failure at a given point in time does not depend on how long it has been since the most recent failure.

O Increasing hazard rate means that the probability of a failure increases, if the time since the last failure has been long.

O Decreasing hazard rate means that the probability of a failure decreases, if the time since the last failure has been long.

Page 19: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

19

Comparing Disk replacement frequency with that of other hardware components

O The reliability of a system depends on all its components not just the hard drives.

O What is the frequency of hard drive failures to other hardware failures?

Page 20: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

20

Hardware Failure Comparison

While the above table suggests that disks are among the most commonly replaced hardware components, it does not necessarily imply that disks are less reliable or have a shorter lifespan than other hardware components.

Page 21: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

21

Responsible Hardware

Node outages that were attributed to hardware problems broken down by the responsible hardware component. This includes all outages, not only those that required replacement of a hardware component.

Page 22: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

22

It is interesting to observe that for these data sets there is no significant discrepancy between replacement rates for SCSI and FC drives, commonly represented as the most reliable types of disk drives, and SATA drives, frequently described as lower quality.

Note HPC4 exclusively SATA drives.

Page 23: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

23

Why are the observed field ARR so much higher than datasheet MTTF?

O Field AFR are more than a factor of two higher than the datasheet AFR.

O Customer and vendor definition of faulty varies

O MTTF are determined based on accelerated stress tests, which make certain assumptions about the operating conditions.

Page 24: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

24

Age dependent replacement rate

Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle.

Page 25: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

25

MTTF & AFR

O ARR larger than suggested MTTF in all years except first year.

O Increasing replacement rate suggesting early wear out. O Disagree with “bottom of bath-tub” analogy.

Page 26: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

26

Distribution of Time

Distribution of time between disk replacements across all nodes in HPC1.

Page 27: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

27

Many have pointed out the need for a better understanding of

what disk failures look like in the field. Yet hardly any published

work exists that provides a large-scale study of disk failures in

production systems.

Page 28: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

28

ConclusionO Field usage appears to differ from datasheet

MTTF conditions. O For drives less than five years old, ARR was

much larger than what the datasheet MTTF suggested by a factor of 2–10. This rate often expected to be in steady state (bottom of the “bathtub curve”).

O In the data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors.

Page 29: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

29

Conclusion

O Concern that MTTF under represents infant mortality but early on-set of wear out is more important than under representation of infant mortality.

O The empirical distribution of time between disk replacements are best fit by a Weibull distribution and gamma distributions and not exponential distributions.

Page 30: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

30

Thank You !

Page 31: Ensieea Rizwani Disk Failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Bianca Schroeder Garth A. Gibson Carnegie Mellon University

31

Citation : Disk failures in the real world:

What does an MTTF of 1,000,000 hours mean to you?

Bianca Schroederhttp://www.cs.cmu.edu/~bianca/

fast07.pdf