Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure

Disk Failures

Eli Alshan

Agenda• Articles survey

– Failure Trends in a Large Disk Drive Population– Article review– Conclusions– Criticism

– Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?

– Article review– Conclusions– Criticism

• Further research suggestion

Definitions• Disk failure - drive is considered to have failed if it

was replaced as part of a repairs procedure– 15-60% of drives considered to have failed at the

user site are found to have no defect by the manufacturers

• MTTF - Mean Time To Failure• AFR - Annual Fail Rate• ARR – Annual Replacement Rate

Failure Trends in a Large Disk Drive Population

• Analysis of drives self monitoring data, collected from large disk drive

• Attempt to isolate parameters highly correlated with disk failures

Results - Utilization

• Very young and very old age groups appear to show the expected behavior

• Possible Explanation -Infant mortality

Results - Temperature

• Lower temperatures are associated with higher failure rates

Results – SMART• Scan Errors – background surface scan errors• Reallocation count - count of sector data

reallocations triggered by recurring errors caused by the sector

• Offline Reallocations - reallocation counts in which only reallocated sectors found during background scrubbing

• Probational Counts - sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems

Results – SMART

Results – SMART• Scan errors affect the survival probability of young

drives dramatically but after the first month the curve flattens out

• Older drives, decline steadily in survival probability throughout the 8-month period

• This behavior could be another manifestation of infant mortality phenomenon

Results – SMART

Conclusions• No consistent pattern of higher failure rates for

higher temperature drives or for those drives at higher utilization levels was found

• Few SMART parameters are well-correlated with higher failure probabilities

• Out of all failed drives, over 36% have no count in any of the SMART signals, temperature or utilization indication before failure

Criticism• Attempt to analyze complex, correlated input data

one parameter at a time might be misleading• Temperature and utilization should be time

windowed, so the reading closer to the failure will receive more attention

• Physical vicinity between tested drives must be taken into account since close drives experience similar environmental conditions

Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?

• An analysis of seven data sets, with a focus on storage related failures– Disk replacement rates observed in the field and compare

our observations with common predictors and models used by vendors

– Statistical properties of disk replacement rates

Disk Replacement Rates• The measured average ARR was 3.4 times larger than

0.88% given in the datasheet

Disk Replacement Rates• Contrary to common

and proposed models, hard drive replacement rates do not enter steady state after the first year of operation, but steadily increase over time

Statistical properties of disk failures

• The hypothesis that time between disks replacements follows an exponential distribution can be rejected with high confidence

• The distribution of time between disk replacements exhibits decreasing hazard rates. Disk replacements are fit best with gamma and Weibull distributions.

Statistical propertiesof disk failures

• The statistical analysis present strong evidence for the existence of correlations between disk replacement intervals. In particular, the empirical data exhibits significant levels of autocorrelation and long-range dependence.

Conclusions• The article demonstrates the lack of reliability of data

MTTF and AFR provided by disk vendors.• Based on the data analysis the papers authors find a

significant correlation between disk failures intervals. • The paper was able to substantiate with significant

statistical confidence the commonly made assumption that exponentially distributed time between failures is not realistic.

• The article identifies as the key features that distinguish the empirical distribution of time between disk replacements from the exponential distribution, higher levels of variability and decreasing hazard rates.

Criticism• Data set size is relatively small which might invalidate

it’s thorough statistical analysis performed• The statistical model suggested in the article seem to

be too simplistic to describe a complex system as a disk in drive population

Further research suggestion• State machine disk health model (HMM)

• State estimation:– Vector of drive health indicators– Current state of the drives physically close to the drive

• Parameters estimation:– BIC + EM (Baum-Welch Algorithm)

𝑆1 𝑆2 𝑆𝑁…

𝑆𝑁 −1

𝑂1 𝑂2 𝑂𝑁𝑂𝑁−1

Transition Probability :

Emission Probability :

Documents

Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure