35
Predicting Storage Failures by Ahmed El-Shimi [email protected] VAULT - Linux Storage and File Systems Conference Cambridge MA. March 22, 2017 Copyright Minima Inc. 2017

Predicting Storage Failures Rebuild is not free 2012 2015 2017 Capacity 4TB HDD 8TB HDD 16TB HDD Throughput 100 MB / Sec 150 MB / Sec 200 MB/ Sec 1-DiskRebuild Time 11 hours 15 hours

  • Upload
    others

  • View
    13

  • Download
    1

Embed Size (px)

Citation preview

PredictingStorageFailuresbyAhmedEl-Shimi

[email protected]

VAULT- LinuxStorageandFileSystemsConferenceCambridgeMA.March22,2017

CopyrightMinimaInc.2017

ThisTalk

• PartI:Motivation• DriveFailure&CommonMitigations• SeekingaBetterRecover/Rebuild• UseCases&Goals

• PartII:ExaminingtheData• Dataset• Features,Trends,Challenges

• PartIII:PredictingDriveFailure• HowtoEvaluate• Baseline• Approach&Models• Evaluation&Results

CopyrightMinimaInc.2017

PartIDisksfail.Andevenwithredundancyfailurehascosts.

CopyrightMinimaInc.2017

TheProblem

Source:HardDriveFailureRateReportQ32016,Backblaze.

2%DriveFailureRate

Averaged,Annualized.

DriveAnnualFailureRatebyManufacturerandModel

1.2%0.7% 0.6% 0.4% 0.4% 0.3% 0.0%

10.2%

3.2%

1.5%0.6%

11.3%

2.2%

0.0%

1 2 3 4 5 6 7 1 2 3 4 1 2 3

HGST Seagate WDC

CopyrightMinimaInc.2017

Today’sMitigation

Assume: “Everythingthatcanfailwillfail”

Design: Redundancyateverypointoffailure

RAID ERASURECODING REPLICATION

CopyrightMinimaInc.2017

ButRebuildisnotfree

2012 2015 2017

Capacity 4TBHDD 8TBHDD 16TBHDD

Throughput 100MB/Sec 150MB/Sec 200MB /Sec

1-Disk RebuildTime 11hours 15hours 22hours

0

200

400

600

800

1000

1200

1400

2002(80GBHDD) 2012(4TBHDD) 2015(8TBHDD) 2017(16TBHDD)

SingleHDDRebuildTime(Minutes)

RebuildInflationhasconsequences:• AvailabilityandDurability9s• Rebuildisaworkload!• Resilience,Reliability• DiskCapacity&NetworkManagement• FailureModes• Lotsofcomplexitytoaddressedgecases

Canwedobetterifwehadanearlywarning?

CopyrightMinimaInc.2017

UseCases&Goals

Cloud

• ProactiveRebuild

• SmarterOpsScheduling

Enterprise/Field

• ProactiveRebuild

• BetterFRUSLA

End-UserPC

• BackupNow• Contingencyplanning

CopyrightMinimaInc.2017

PartIIExaminingtheData.

CopyrightMinimaInc.2017

TheBackblaze Dataset

• Backblaze.com:OnlineBackupandCloudStorageprovider.• 83+Kdrives• 2013-2016• Seagate,Hitachi,HGST,WesternDigital,Toshiba,Samsung

https://www.backblaze.com/b2/hard-drive-test-data.html• Hatsofftothemforsharingtheirdataopenly.

CopyrightMinimaInc.2017

UnderstandingtheData

Dataset:• 83Kdrives• 46MDriveDays(2013-2016)• >5000failures• Dailysnapshotforeachdrive’shealth

state+SMARTmetrics• SMART(Self-Monitoring,Analysisand

ReportingTechnology)astandardmonitoringsystemincludedinHDDsandSSDs

ExampleSMARTAttributes:• SMART_1:ReadErrorRate• SMART_5:ReallocatedSectorsCount• SMART_9:PowerOnHours• SMART_7:SeekErrorRate• SMART_197:CurrentPendingSector

Count

https://en.wikipedia.org/wiki/S.M.A.R.T.

CopyrightMinimaInc.2017

SMART5:ReallocatedSectorCount

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART1:ReadErrorRate

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART9:PowerOnHours

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

SMART197:CurrentPendingSectorCountover30days

Sampleof7678drives(50%failed/50%healthy)

CopyrightMinimaInc.2017

PartIIIDiskfailurescanbepredicted.Notperfectly.Butbetterthansimpleheuristics.

CopyrightMinimaInc.2017

PerformanceMetric

• Goals:• Detectrareevent(1/20orless)• TunedependingonUseCase

• ToleranceforFalsePositives• vs.ToleranceforFalseNegatives

• Wewanttomaximizeourabilitytomakebettertradeoffs

• PerformanceMetric:• PR.AUC:AreaUnderPrecision-Recallcurve

PR.AUC

Precision

RecallCopyrightMinimaInc.2017

BaselineHeuristic

• IfanyofthecriticalSMARTattributes>0thenthedriveislikelytofail

• SMART_5:ReallocatedSectorsCount• SMART_187:ReportedUncorrectable• SMART_188:CommandTimeout• SMART_197:CurrentPendingSectorCount• SMART_198:OfflineUncorrectable

CopyrightMinimaInc.2017

BaselinePerformance

• Evaluationdataset:• 13980drives• 699failed• 13281healthy

• Precision:42%• (i.e.58%falsepositives)

• Recall:68%• (i.e.32%falsenegatives)

ConfusionMatrix

BaselinePrediction

Healthy Failed

TruthHealthy 12625 656

Failed 223 476

CopyrightMinimaInc.2017

Approach

• Splitthedataintotrain/testandEval• 2013-2015:Train/Test• 2016:Eval

• Samplefromthetraindataat50/50• (Learnequallyfromfailure/health)

• SamplefromtheEval dataat95/05• (Evaluateatafixedreal-lifefailure/healthymix)

CopyrightMinimaInc.2017

Featuremanagementchallenges

• InconsistencyofSMARTdatasupportacrossvendorsanddrivemodels• DataSparseness• OpacityofmostSMARTmetrics• FurtheropacityofNormalizedSMARTvalues• WideRangeformostSMARTvalues• Datasetskewnessbyvendorandmodel• Somegapsandinconsistenciesinthetelemetrydata

CopyrightMinimaInc.2017

FeatureSelection&Models

• FeatureSelection:• RawoverNormalizedSMARTData• Createdmodel_fam featuretocollapsevendor/model• Z-ScoreNormalizationofRAWvalues• 3-day,5-dayrollingVarianceforallSMARTRAWvalues• andafewthingswhichdidn’tworkaswellasinitiallyhopedJ

• Models:• RandomForests• LogisticRegression• SupportVectorMachines

• Goal:• Improveonthebaseline• ExpandoptionstotuneDecisionThresholdforvariousPrecisionvs.Recalltradeoffs(basedonUseCase)

CopyrightMinimaInc.2017

Model1:RandomForests

SingleDecisionTree RandomForestCopyrightMinimaInc.2017

Model2:LogisticRegression

• Sigmoid(Step)functiontomodelabinomialoutput(0/1)

• Workswellforlinearcontinuousnumericalinputs• Corollary:Horriblyforcategoricalnon-linearvariablesappearingtobecontinuousnumerical

• Inourcasethekeyisto:• IsolaterightSMARTmetrics• Normalizewhereneeded

CopyrightMinimaInc.2017

Model3:SupportVectorMachines

• BinaryClassifierbasedonmappingnon-lineardataintoahigherdimensiontomakelinearseparationpossible

• Maximizesmarginofseparationbetween+/-categories

• InourcasethekeywasstilltoisolaterightSMARTmetricsbeforetraining

ByAlisneaky,svg versionbyUser:Zirguezi - Ownwork,CCBY-SA4.0,https://commons.wikimedia.org/w/index.php?curid=47868867

CopyrightMinimaInc.2017

Results:Seagatedrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-1

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-2

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-4

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Seagatedrives.Day-8

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Hitachidrives.DayZero

CopyrightMinimaInc.2017

RandomModelAUC=0.05

Results:Hitachidrives.Day-1

CopyrightMinimaInc.2017

RandomModelAUC=0.05

FeatureImportance– SMARTRaw– AllModels

0

500

1000

1500

2000

2500

3000

3500

4000

Importance(RandomForestsmodeloverSMARTRawfeatures)

CopyrightMinimaInc.2017

FeatureImportance– SMARTVariance- Seagate

0

50

100

150

200

250

300

350

400

Importance(RandomForestsmodeloverSMARTRollingVariancefeatures)

CopyrightMinimaInc.2017

Summary

• 2%ofDisksfailannually,onaverage.Mileagevariesbymodel.• SMARTmetricscanclearlysignalfailures,sometimesdaysbeforeithappens• Wecantrain/predictacrosssomeofthedrivemodelsovercomingtrainingdatasparsity• WecanreasonablytrainmodelstopredictdrivefailureusingSMARTdataandimproveuponexistingheuristics• PickingandtuningtherightmodeldependsonUseCaseandgoals

• Precisionvs.Recalltradeoff• Differentmodelsgiveyoumoreoptions

• MoreDataisbetter.Duh!

CopyrightMinimaInc.2017