Upload
rosamund-farmer
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Disk Failures
Eli Alshan
Agenda• Articles survey
– Failure Trends in a Large Disk Drive Population– Article review– Conclusions– Criticism
– Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?
– Article review– Conclusions– Criticism
• Further research suggestion
Definitions• Disk failure - drive is considered to have failed if it
was replaced as part of a repairs procedure– 15-60% of drives considered to have failed at the
user site are found to have no defect by the manufacturers
• MTTF - Mean Time To Failure• AFR - Annual Fail Rate• ARR – Annual Replacement Rate
Failure Trends in a Large Disk Drive Population
• Analysis of drives self monitoring data, collected from large disk drive
• Attempt to isolate parameters highly correlated with disk failures
Results - Utilization
• Very young and very old age groups appear to show the expected behavior
• Possible Explanation -Infant mortality
Results - Temperature
• Lower temperatures are associated with higher failure rates
Results – SMART• Scan Errors – background surface scan errors• Reallocation count - count of sector data
reallocations triggered by recurring errors caused by the sector
• Offline Reallocations - reallocation counts in which only reallocated sectors found during background scrubbing
• Probational Counts - sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems
Results – SMART
Results – SMART• Scan errors affect the survival probability of young
drives dramatically but after the first month the curve flattens out
• Older drives, decline steadily in survival probability throughout the 8-month period
• This behavior could be another manifestation of infant mortality phenomenon
Results – SMART
Conclusions• No consistent pattern of higher failure rates for
higher temperature drives or for those drives at higher utilization levels was found
• Few SMART parameters are well-correlated with higher failure probabilities
• Out of all failed drives, over 36% have no count in any of the SMART signals, temperature or utilization indication before failure
Criticism• Attempt to analyze complex, correlated input data
one parameter at a time might be misleading• Temperature and utilization should be time
windowed, so the reading closer to the failure will receive more attention
• Physical vicinity between tested drives must be taken into account since close drives experience similar environmental conditions
Disk failure s in the real world: What does an MTTF of 1,000,000 hours mean to you?
• An analysis of seven data sets, with a focus on storage related failures– Disk replacement rates observed in the field and compare
our observations with common predictors and models used by vendors
– Statistical properties of disk replacement rates
Disk Replacement Rates• The measured average ARR was 3.4 times larger than
0.88% given in the datasheet
Disk Replacement Rates• Contrary to common
and proposed models, hard drive replacement rates do not enter steady state after the first year of operation, but steadily increase over time
Statistical properties of disk failures
• The hypothesis that time between disks replacements follows an exponential distribution can be rejected with high confidence
• The distribution of time between disk replacements exhibits decreasing hazard rates. Disk replacements are fit best with gamma and Weibull distributions.
Statistical propertiesof disk failures
• The statistical analysis present strong evidence for the existence of correlations between disk replacement intervals. In particular, the empirical data exhibits significant levels of autocorrelation and long-range dependence.
Conclusions• The article demonstrates the lack of reliability of data
MTTF and AFR provided by disk vendors.• Based on the data analysis the papers authors find a
significant correlation between disk failures intervals. • The paper was able to substantiate with significant
statistical confidence the commonly made assumption that exponentially distributed time between failures is not realistic.
• The article identifies as the key features that distinguish the empirical distribution of time between disk replacements from the exponential distribution, higher levels of variability and decreasing hazard rates.
Criticism• Data set size is relatively small which might invalidate
it’s thorough statistical analysis performed• The statistical model suggested in the article seem to
be too simplistic to describe a complex system as a disk in drive population
Further research suggestion• State machine disk health model (HMM)
• State estimation:– Vector of drive health indicators– Current state of the drives physically close to the drive
• Parameters estimation:– BIC + EM (Baum-Welch Algorithm)
𝑆1 𝑆2 𝑆𝑁…
𝑆𝑁 −1
𝑂1 𝑂2 𝑂𝑁𝑂𝑁−1
Transition Probability :
Emission Probability :