14
This article was downloaded by: [University of California, San Francisco] On: 05 December 2014, At: 11:04 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK IIE Transactions Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uiie20 A reliability modeling framework for the hard disk drive development process Loon Ching Tang a , Shao Wei Lam a , Quock Y. Ng b & Jing Shi Goh b a Department of Industrial and Systems Engineering , National University of Singapore , 10 Kent Ridge Crescent, Singapore, 119260 b Seagate Technology International , 63 The Fleming, Science Park Drive, Singapore, 118249 Published online: 02 Feb 2010. To cite this article: Loon Ching Tang , Shao Wei Lam , Quock Y. Ng & Jing Shi Goh (2010) A reliability modeling framework for the hard disk drive development process, IIE Transactions, 42:4, 260-272, DOI: 10.1080/07408170902906019 To link to this article: http://dx.doi.org/10.1080/07408170902906019 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

A reliability modeling framework for the hard disk drive development process

Embed Size (px)

Citation preview

Page 1: A reliability modeling framework for the hard disk drive development process

This article was downloaded by: [University of California, San Francisco]On: 05 December 2014, At: 11:04Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

IIE TransactionsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uiie20

A reliability modeling framework for the hard diskdrive development processLoon Ching Tang a , Shao Wei Lam a , Quock Y. Ng b & Jing Shi Goh ba Department of Industrial and Systems Engineering , National University of Singapore , 10Kent Ridge Crescent, Singapore, 119260b Seagate Technology International , 63 The Fleming, Science Park Drive, Singapore, 118249Published online: 02 Feb 2010.

To cite this article: Loon Ching Tang , Shao Wei Lam , Quock Y. Ng & Jing Shi Goh (2010) A reliability modeling framework forthe hard disk drive development process, IIE Transactions, 42:4, 260-272, DOI: 10.1080/07408170902906019

To link to this article: http://dx.doi.org/10.1080/07408170902906019

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: A reliability modeling framework for the hard disk drive development process

IIE Transactions (2010) 42, 260–272Copyright C© “IIE”ISSN: 0740-817X print / 1545-8830 onlineDOI: 10.1080/07408170902906019

A reliability modeling framework for the hard disk drivedevelopment process

LOON CHING TANG1,∗, SHAO WEI LAM1, QUOCK Y. NG2 and JING SHI GOH2

1Department of Industrial and Systems Engineering, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260E-mail: [email protected] Technology International, 63 The Fleming, Science Park Drive, Singapore 118249

Received August 2007 and accepted March 2009

Motivated by the fact that the major causes of catastrophic failure in micro hard disk drives are mostly induced by the presence ofparticles, a new particle-induced failure susceptibility metric, called the Cumulative Particle Counts (CPC), is proposed for managingreliability risk in a fast-paced hard disk drive product development process. This work is thought to represent the first successfulattempt to predict particle-induced failure through an accelerated testing framework which leverages on existing streams of researchfor both particle-injection-based and inherent-particle-generation laboratory experiments to produce a practical reliability predictionframework. In particular, a new testing technique that injects particles into hard disk drives so as to increase the susceptibility offailure is introduced. The experimental results are then analyzed through a proposed framework which comprises the modeling ofa CPC-to-failure distribution. The framework also requires the estimation of the growth curve for the CPC in a prime hard diskdrive under normal operating conditions without particle injection. Both parametric and non-parametric inferences are presentedfor the estimation of the CPC growth curve. Statistical inferential procedures are developed in relation to a proposed non-linear CPCgrowth curve with a change-point. Finally, two applications of the framework to design selection during an actual hard disk drivedevelopment project and the subsequent assessment of reliability growth are discussed.

Keywords: Reliability improvement, accelerated testing, generalized non-linear models, change-point, monotone LOWESS, hard diskdrives

1. Introduction

Assessing the reliability of products involving nano-scalestructures is one of the key issues in the development ofmicro-electromechanical products for high-tech applica-tions. A reliability modeling framework is proposed fora highly reliable micro-electromechanical system, namelya micro-Hard Disk Drive (HDD). A micro-HDD is lessthan 1 inch along the largest dimension and comprises theusual HDD components of a magnetic disk platter and anactuator arm that suspends over the platter with readingheads at its tip that slide across the disk platter to performread–write operations. Such a framework has been success-fully applied in the reliability assessment of design changesand reliability growth tracking in an actual HDD productdevelopment process.

The usefulness of this methodology stems from the keyobservation that maturing HDD design and productionprocesses are accompanied by a corresponding improve-ment in reliability of HDDs through the reduction in par-

∗Corresponding author

ticle counts for HDDs of later vintages (Shah and Elerath,2004). It has also been found in previous studies that morethan 40% of early failures in HDDs were due to particles(Hiller, 1995). Particle contaminants are typically trappedduring production or generated in HDDs due to wear be-tween contacting interfaces. Particle-induced failures aremanifested in a variety of failure modes (e.g., circumferen-tial scratches, scratches along different head–disk interfacedirections, head media crashes). As the areal density ofdisk surface increases, the clearance between head/sliderand disk has to be reduced (Bhushan, 1996; Strom et al.,2007). This would typically result in an increase in particle–head interactions, temperature and wear particles, therebyresulting in a multitude of particle-induced failure modes(e.g., Hiller and Brown, 1994; Liu et al., 1996; Yoon andBhushan, 2001). Thermal asperity has been reported asone particularly detrimental form of particle-induced fail-ure mode (Li and Sharma, 2000). Such a phenomenonseriously reduces the reliability of a HDD and results insignificant performance degradation (Park et al., 1999).

In this article, we introduce a new data-driven modelingframework to characterize the susceptibility of failure inHDDs due to particle contaminants. Most of the existing

0740-817X C© 2010 “IIE”

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 3: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 261

papers in the literature on particle contaminants in HDDshave either investigated the failure characteristics underparticle injection (e.g., Hiller and Singh, 1991; Koka andKumaran, 1991; Bergin and Koka, 1993) or inherent par-ticle generation characteristics (e.g., Park et al., 1999) inlaboratory-based tests. However, laboratory-based tests re-lated to particle-induced failure modes are scarce in the lit-erature as the experimental setup and test procedures varyaccording to the experimental objectives. To our knowl-edge, this work represents the first successful attempt inleveraging particle injection for the reliability assessmentof particle-induced failure through an accelerated testingframework based on Cumulative Particle Counts (CPCs).In fact, none of the previous related research utilized CPCsas an indication of particle-induced failure susceptibilityor even proposed the use of a susceptibility indicator as analternative to time for accelerated testing of HDDs.

Based on prior understanding of the key failure mecha-nisms in HDDs, it is conceivable that particle-induced fail-ures can be replicated in a laboratory setting with a muchshorter time by the injection of particles (Tang et al., 2007).This offers an attractive reliability assessment alternativegiven the high designed-in reliability of micro-HDDs andthe extremely short HDD product development process.The key idea of the proposed framework is to utilize a newmeasurement scale, known as CPC, as an alternative toreal time (seconds, hours, months, etc.) and can be “accel-erated” to derive reliability predictions. The use of cumu-lative instead of incremental counts of particles also helpsto smooth out variations. Considering reliability modelingin general for other industries, such an alternative scale isnot unlike the use of cumulative mileage for the reliabilityprediction and warranty design of cars (e.g., Lawless, 1995;Eliashberg et al., 1997). In general, the use of an alterna-tive scale for reliability prediction may also resemble theframework proposed in the maintenance literature wherethe proportional hazard model is used to relate degrada-tion characteristics (e.g., number of particles in engine oil)to the hazard rate of an equipment (Makis et al., 1998;Makis et al., 2006). However, there is a fundamental differ-ence between the current framework and that of the abovework. In the maintenance context, one deal with observa-tional data in the form of wear debris or other degradationmeasures from which some reliability model is established.In contrast, here we hasten the failure process by inducingsusceptibility of failure, thereby resulting in a frameworksimilar to that of accelerated testing from which a plausiblereliability model can be established.

Our primary motivation is the development of a suitableaccelerated testing methodology to achieve an extremelyshort product development process of 3 to 9 months (Fig. 1shows a typical HDD product development process) that issuitable for the reliability assessment across early manufac-turing build phases. “Prime” HDDs, which refers to fullyfunctional HDD builds, are available through early man-ufacturing builds. Reliability predictions from such drivesprovide reasonable approximations to the HDD reliability

Fig. 1. Simplified HDD product development process.

at mass production. The experimental setup, processes anddata collection framework are described in the next sec-tion. Due to the inherent complexity of the micro-HDDunder study, a data-driven modeling framework is adoptedinstead of the usual stochastic process modeling approach(Singpurwalla and Wilson, 1998; Duchesne and Lawless,2000).

A schematic showing the proposed framework is shownin Fig. 2. It comprises a CPC-to-Failure (CTF) model anda growth curve describing the increase of CPC across timefor a HDD under normal operating conditions. The CTFmodel is estimated from data derived from tests with parti-cle injections whereas the CPC growth model is estimatedfrom drives without particle injection. HDDs under test aresimultaneously operating under a special profile that mim-ics extreme operational conditions. Typical profiles includeoperations such as read/write, load/unload and randomseek. This allows design engineers to obtain relatively con-servative predictions of CPC growth. The CTF distributionand CPC growth model are used to establish the Time-to-Failure (TTF) distribution for reliability assessments. “Ac-celeration” can be achieved in the real time scale (requisitetest period can be shortened from months to days) as theCTF distribution can be obtained within a much shortertime using high particle injection rates.

Given the two main modeling components, the CTF dis-tribution and CPC growth model, this article is organizedas follows. First, a brief description of the experimentalprocess and data analysis for establishing the CTF distri-bution are presented. CPC growth models for prime HDDs,based on both a parametric generalized non-linear regres-sion framework and a non-parametric regression modelingapproach, are then proposed. The main theme underlyingthe approach is the use of a regression model with a distinctdeterministic and a random component. The combinationof the CTF distribution and CPC growth model resultsin a TTF model for particle-induced failure of HDD. Fi-nally, we discuss two applications of adopting CPC as analternative to the real-time scale, namely, design selectionand reliability assessment for different vintages of HDDs.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 4: A reliability modeling framework for the hard disk drive development process

262 Tang et al.

Fig. 2. Proposed reliability modeling framework.

2. CTF distribution

2.1. Experimental processA schematic diagram of the experimental process for eval-uating the CTF distribution by subjecting prime HDDs

to two difference phases of testing, namely, dynamic andstatic phases, is shown in Fig. 3 (Tang et al., 2006). Duringthe dynamic phase, HDDs continuously spin with simulta-neous particle injection. A sequence of typical HDD op-erations are also carried out simultaneously together with

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 5: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 263

Fig. 3. Experimental process flow and data collection scheme.

the continuous detection of new failure codes (i.e., new de-fects). At the end of the dynamic phase, particle injectionis stopped, and this marks the start of the static phase inwhich the sequence of HDD operations is restarted withoutparticle injection. This two-phase process was conducted ina mini-clean room environment and failure data, comprisedof the TTF and CTF data, was collected throughout theexperimental process. Failures were detected through errorcodes. Engineering failure analyses were performed to de-termine the failure modes for these failures. Some actualparticle-induced failure data (in terms of CPC) and corre-sponding particle injection rates for a particular vintage ofa 1-inch HDD are given in Table 1.

2.2. Sample data analysis

From the data, it was observed that by subjecting the HDDsto different rates of particle injection, the time to particle-induced failures decreases. Analysis under the usual frame-work of an accelerated life test under a single stress vari-able based on Weibull failure time distribution (Meeker andEscobar, 1998) revealed inconsistencies in the Weibull plotsat different injection rates. Another difficulty encounteredin adopting the usual accelerated life test framework is

that under normal operating conditions, the rate of parti-cle generation may not be constant. In view of these issues,the approach using CPC is proposed. Under this approach,the TTF distribution can be derived from the combinationof the CTF model and the CPC growth model. The useof CPC as a surrogate measure for real time allows for ac-celerated tests. Moreover, in numerous tests conducted fordifferent micro-HDD designs, the CTF distribution consis-tently followed a lognormal distributional form for a verywide range of particle injection rates. The lognormal prob-ability plot for a sample CTF dataset (Table 1) is givenin Fig. 4 with maximum likelihood estimates of the log-normal parameters. Additional distributional comparisonsbased on probability plotting for such an analysis can bemade using the SPLIDA package (Meeker and Escobar,1998; Meeker and Escobar, 2006). Conceptually, the log-normal distribution can be derived from a damage accu-mulation model (Gertsbakh and Kordonsky, 1969; Tobiasand Trindade, 1998). Furthermore, the genesis of the log-normal distribution has been described from the perspec-tive of the “proportionality” of an underlying generatingmechanism (Atchison and Brown, 1957). This is plausibleas susceptibility increases as more particles flow throughthe head–disk interface.

Table 1. Injection rate and CPC at failure (F)/censored (C) data

Injectionrate CPC

Failure/censored

Injectionrate CPC

Failure/censored

Injectionrate CPC

Failure/censored

4¯00 168 536 F 2000 193 084 F 9000 217 992 F

4¯00 242 064 C 2000 38 248 F 9000 272 468 F

4¯00 242 064 C 2000 301 920 C 9000 517 636 F

4¯00 186 944 F 2000 299 797 C 9000 217 956 F

4¯00 170 408 F 2000 292 380 F 12 000 253 744 F

4¯00 242 064 C 2000 290 220 C 12 000 1341 996 F

2000 44 592 F 2000 290 220 C 12 000 905 952 F2000 313 936 C 2000 70 888 F 12 000 1345 544 C2000 266 240 F 7500 812 716 C 21 000 822 136 F2000 307 636 C 7500 295 780 F 21 000 631 84 F2000 304 564 C 7500 797 168 C 21 000 345 568 F2000 301 564 C 7500 797 196 F 21 000 316 224 F

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 6: A reliability modeling framework for the hard disk drive development process

264 Tang et al.

95% NormalConfidence Intervals

EstimateStandard

Error Lower Upper

Location, µ 12.920 0.205 µ = 12.518 µ̄ = 13.323Scale, σ 1.084 0.169 σ = 0.799 σ̄ = 1.470∗Pearson correlation coefficient: 0.98

Fig. 4. Lognormal probability plot of CTF distribution.

3. CPC growth curve

In order to relate the CTF to the TTF distribution, a de-terministic model describing the growth of CPC over realtime is needed. The experiment to establish the CPC growthcurve was conducted under simulated “normal” usage con-ditions without particle injections using the experimentalsetup shown in Fig. 1 on prime drives. The postulatedgrowth model can be established based on both empiri-cal observations and engineering insights. Both parametricand non-parametric regression models can be used withinthe proposed framework.

The experimental process for establishing the CPCgrowth model is described in the next section. In orderto establish a realistic parametric CPC growth model,we adopted a data-driven perspective with practical en-gineering interpretations to derive the statistical parame-ters. In such an approach, a preliminary Exploratory DataAnalysis (EDA) procedure is first initiated to visualize theraw empirical data. Subsequently, a plausible parametricCPC growth model is established from engineering in-sight and EDA based on a Generalized Non-Linear Model

(GNLM). The EDA also provides some initial estimatesfor the GNLM. A non-parametric monotone smoothingtechnique is also proposed as an alternative to the GNLM.

3.1. Preliminary data exploration

An EDA approach based on existing approaches for recur-rent events (Nelson, 2003) is proposed here to visualize andassess the suitability of different parametric growth modelsfor the CPC. Suppose that there are k prime drives on whichthe CPC were recorded. Let {tj : j = 1, 2, . . . , N} denotethe ordered time sequence when a particle detection eventoccurs for all the k drives and let tN = T be the terminationtime of tests. Let Ii (t) denote an indicator function withvalue one when a particle detection event in the i th driveoccurs and zero otherwise. The number of particle detectionevents that have occurred by time t in drive i is thus givenby ri (t) = ∑

j :t≥tjIi (tj ). Furthermore, since multiple parti-

cles may be detected in each particle detection event, letni (t) represent the number of particle generated by the i thdrive at time t. The sample CPC growth function for the i th

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 7: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 265

Dri

ve I

D

i

2

1

t

( ) 5=tri; ( ) 8=tgi

( ) 1=tI i

timet1 t2 t3 T

particles

( ) 3=tni

Fig. 5. Event plot depicting the various notational definitions.

drive tested is given by gi (t) − ∑j :t≥tj

ni (ti ). The relation-ships between Ii (t), ni (t), ri (t) and gi (t) are shown in Fig. 5.The mean CPC function across all tested drives is definedas the Cumulative Particle Function (CPF). The CPF is akey function utilized in the reliability modeling frameworkto transform the CTF distribution to the TTF distribu-tion. The CPF is formally defined as m(t) = ∑k

i=1 gi (t)/k.The growth of CPF and CPC are plotted over real timein Fig. 6. Some sample calculations for the ri (tj ), ni (tj ),gi (tj ) and m(t) quantities will be presented in a later section(Table 3).

The CPF plot indicates that the rate of particle gener-ation appears to reduce over time. Further examinationof the CPF reveals that there are probably two phases ofparticle production. The initial phase could be primarilydue to loose particles trapped inside drives during theirmanufacture. The subsequent steady-state particle genera-tion phase has a different trend and is asymptotically lin-ear. The asymptotic linear growth can be explained as asteady growth of CPC as particles passing through the in-ternal volume of the drive are accumulated over time. Aplausible parametric model for the CPF with engineering

Fig. 6. CPC and CPF for prime HDDs (note: rescaled time inminutes).

interpretation for all the parameters can be based on achange-point-type model which switches from a generalpower law form to an asymptotic linear form.

3.2. Parametric model

The CPF data takes the form of repeated by remeasure data.In the proposed framework, a GNLM is used to estimatethe parametric form of the growth curve for the CPF data.The response here is the mean CPC at each observationtime. Following the notation given in the existing litera-ture (Lindstrom and Bates, 1990), the GNLM defining thegrowth curve can be written as yj = f (tj ,β) + e j whereyj represents the j th mean CPC, yi = m(tj ); tj is the timeat which the j th particle is observed; f is the non-lineargrowth curve function; β is the vector of parameters inthe non-linear growth curve function; and ei is the randomerror term.

There are distinct deterministic and random compo-nents in such a model. The deterministic component of theGNLM is postulated by a switching growth curve model:

f (t) ={

eβ1 tβ2, 0 ≤ t ≤ βτ

eβ3 + eβ4 (t − βτ ), t > βτ

(1)

where β1 and β2 are parameters of the power-law particlegrowth function at the run-in phase, βτ is the change-point(defines the end time of the initial phase), eβ3 is the CPFlevel at the transition between power law run-in and linearsteady-state phases, and eβ4 is the particle generation rateduring the steady-state phase.

Under this change-point model, the particle genera-tion rate is modeled as non-constant and non-linear dur-ing a run-in phase and constant during the steady-statephase. The parameters have clear and straightforward en-gineering interpretations. There is a single change-point,βτ , in the model. eβ3 is the CPF level when the growthfunction transitions from the run-in phase to the steady-state phase. The parameter β4 essentially describes theasymptotic constant rate of CPC growth. The relation-ship between the parameters is given by β3 = β1 + β2 ln βτ .

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 8: A reliability modeling framework for the hard disk drive development process

266 Tang et al.

To ensure that the model is continuous and differentiableat the change-point, a transition function is introduced(Bacon and Watts, 1971). The transition function givesa weighting factor to the model components before andafter the change-point, thus allowing for particle gener-ation characteristics of the two phases to be manifestedthroughout the entire model. This is more consistent withactual engineering interpretations. The transition functioncan be defined as trn(t) = eβγ t/(1 + eβγ t) where βγ con-trols the speed and smoothness of the transition.1 Usingthe transition function, the parametric CPF is f (t) =eβ1 tβ2 trn(βτ − t) + (eβ3 + eβ4 (t − βτ ))trn(t − βτ ).

For remeasured data, the random error term in theGNLM may not be independent and identically dis-tributed. In order to allow for heteroskedasticity, a gen-eral variance function is assumed: Var(yj ) = σ 2g2(µ j ,θ)(Davidian and Giltinan, 1995; Vonesh and Chinchilli,1997). Here, µ j represents the non-linear growth curvefunction, θ is the vector of parameters in the variance func-tion and σ is the scale parameter which governs the preci-sion. One common flexible choice for a variance functionis where g(µ j , θ) = µθ

j (usually, θ > 0). Such a functionincludes common models such as the constant coefficientof variation model (θ = 1) and Poisson model (θ = 0.5).A general model for describing correlation among ob-servations is the exponential model (Diggle, 1988) wherethe correlation structure of two observations at (tj1, tj2 )is defined as �(tj1,tj2)(α1, α2) = exp(−α1|tj1 − tj2 |α2 ). Such astructure covers many commonly encountered autocorrela-tion characteristics (for example, α2 = 1 corresponds to thecontinuous-time analogue of a first-order autoregressiveprocess). When both the heterogeneity of variance and au-tocorrelation between observations are allowed to be takeninto consideration, the error covariance structure within theGNLM is modeled as σ 2G1/2(β, θ)�(α1)G1/2(β, θ) where,G1/2(β, θ) is a diagonal matrix with elements g(µ j , θ) and�(α1) is a correlation matrix defining the autocorrelationstructure with α2 = 1.

3.2.1. Parameter estimationThe generalized non-linear least squares (“gnls”) functionin S-PLUS (Pinheiro and Bates, 2002) can be used to evalu-ate the parameters of the GNLM. The inference procedureis a combination of least squares and restricted maximumlikelihood estimation which require good starting values asinputs. An algorithm called FIP (Find Initial Parameter)is proposed to obtain a set good initial values for fittingthe GNLM. An overview of this algorithm is given here

1This transition function adheres to the three conditions definedin Bacon and Watts (1971). The emphasis here is on the analysisframework and, thus, the particular form is not of great impor-tance. Goodness-of-fit tests based on the likelihood ratio of theproposed models (see Table 4), fitted plot (see Fig. 7) and residualplots (see Fig. 8) can be used to assess the suitability of the model.

(S-PLUS codes for this algorithm are available from theauthors).

Algorithm FIP.

Step 1. Obtain initial estimates for the change-point, βτ ,by examining the graph shown in Fig. 6. Definethis parameter estimate as β̂0

τ .Step 2. Derive initial values for β1, β2 and β4 using lin-

earized models2 estimated from data before andafter the initial change-point estimate given in Step1. Define these parameter estimates as β̂0

1 , β̂02 and

β̂04 respectively.

Step 3. Derive a non-linear least squares estimate withassumptions of variance homogeneity and noautocorrelation for βτ , β1, β2 and β4 using the non-linear deterministic component (Equation (1)), ini-tial change-point estimate, β̂0

τ , and the linear modelestimates, β̂0

1 , β̂02 and β̂0

4 . Also, evaluate the sum ofsquares residuals (SSresiduals).

Step 4. Iterate Step 3 over an initial estimated range whereβγ may exist and evaluate the sum of squares resid-uals. Select the estimates where the SSresidualsis minimized. Define these as β̂1

τ , β̂1γ , β̂1

1 , β̂12 and

β̂14 .

The FIP Algorithm proposed here is essentially a linearleast squares method based on a linearized model cou-pled with an iterative non-linear least squares regressionapproach for identifying good starting values. Such anapproach would provide consistent estimates for GNLMunder certain regularity conditions (Vonesh and Carter,1992). Furthermore, the essence of this algorithm adheresto the guidelines for effectively determining good start-ing estimates (Bates and Watts, 1988) with respect to: (i)the choice of parameters with meaningful graphical in-terpretations (Step 1); (ii) taking advantage of partiallylinear models (Step 2); and (iii) the refinement of someparameters by iterating on them while holding all otherparameters fixed at current values (Steps 3 and 4). Thismethod is also similar to some traditional methods inpharmacology for estimating a double-exponential model(Ratkowsky, 1990).

3.2.2. Confidence intervalsAn approximate 100(1 − α)% confidence interval at a prespecified time t0 for the GNLM can be evaluated basedon large sample normal approximations (Seber and Wild,

2The linearized model to evaluate β̂01 and β̂0

2 is ln( f (t)) −ln(trn(β̂0

τ − t)) = β1 + β2 ln t. This model is fitted with data be-fore the change-point. The linear model for evaluating β̂0

4 isf (t) = eβ3 + eβ4 (t − β̂0

τ ). β̂04 is obtained by taking the natural

logarithm of the slope of this function.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 9: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 267

Table 2. CPF data (time scale in minutes)

Time (tj ) m(ti ) Time (tj ) m(ti ) Time (tj ) m(ti ) Time (tj ) m(ti ) Time (tj ) m(ti )

1 0.20 81 111.80 262 679.20 610 1969.60 3076 5349.0012 2.60 93 130.40 268 732.80 646 2098.80 4196 6188.2014 5.40 117 153.80 270 786.80 719 2242.60 4691 7126.4015 8.40 129 179.60 273 841.40 752 2393.00 4756 8077.6022 12.80 162 212.00 295 900.40 758 2544.60 4996 9076.8030 18.80 179 247.80 319 964.20 774 2699.40 5261 10 129.0032 25.20 184 284.60 450 1054.20 797 2858.80 5951 11 319.2033 31.80 193 323.20 451 1144.40 797 3018.20 6211 12 561.4035 38.80 195 362.20 457 1235.80 806 3179.40 6226 13 806.6037 46.20 199 402.00 474 1330.60 814 3342.20 6243 15 055.2038 53.80 211 444.20 480 1426.60 845 3511.20 6296 16 314.4044 62.60 216 487.40 507 1528.00 891 3689.40 6386 17 591.6049 72.40 223 532.00 527 1633.40 1211 3931.60 6736 18 938.8056 83.60 232 578.40 535 1740.40 1270 4185.60 6746 20 288.0060 95.60 242 626.80 536 1847.60 2741 4733.80

1989; Vonesh and Chinchilli, 1997) given as

[ f̄ (t0, f (t0)] =⌊

f (t0) − zα/2

√σ̂ 2[d′

0�̂d0], f (t0)

+ zα/2

√σ̂ 2[d′

0�̂d0]⌋, (2)

where �−1 = D′S−1(β, θ, α1)D. Here, D is the matrix ofpartial derivatives of f (t) taken with respect to parame-ters and evaluated at the final estimates and S(β, θ, α1) =G1/2(β, θ)�(α1)G1/2(β, θ). d′

0 is the vector of partial deriva-tives of f (t) evaluated at the final parameter estimates fora prediction at time t0, zα/2 is the 100(1 − α/2) percentagepoint of the standard z distribution. σ̂ 2 is the estimatedscale parameter and σ̂ 2� represents the large sample co-variance matrix of the estimated parameters.

3.2.3. Sample computationsThe complete CPF data is shown in Table 2. For thisdataset, the initial parameter estimates obtained from theFIP algorithm are β̂1

γ = 0.0049, β̂1τ = 843.41, β̂1

1 = −1.348,β̂1

2 = 0.555 and β̂14 = −7.348. With these initial estimates,

the generalized non-linear least squares function in S-PLUS

Table 3. Predicted CPF and upper 95% confidence intervals

Orderedtime (tj ) Drive ID gi (tj ) ni (tj ) ri (tj ) m(ti ) f (tj ) f̄ (tj )

1 N5 1 1 1 0.20 0.75 1.0112 N2 1 1 1 0.40 1.35 1.5614 N3 1 1 1 0.60 1.43 1.6415 N5 2 1 2 0.80 1.47 1.6822 N3 2 1 2 1.00 1.71 1.92: : : : : : : :6736 N4 11 1 11 15.20 14.74 15.266746 N5 12 1 12 15.40 14.74 15.26

(Pinheiro and Bates, 2002) can then be used to evaluate theparameters of the GNLM. Table 3 shows some sample cal-culations for the predicted CPF and upper 95% confidencelimits for a particular CPC dataset. The log-likelihood ra-tios and p-value for the comparisons of model with andwithout the variance function and autocorrelation struc-ture is shown in Table 4. From the test results, the ad-dition of the power law variance function and exponen-tial autocorrelation structure both appear to produce animproved estimate of the parametric CPF function. Theparameter estimates and 95% confidence intervals for aGNLM with a power law variance function and an ex-ponential autocorrelation structure are shown in Table 5.A plot of the model estimates together with approximateupper 95% confidence intervals is shown in Fig. 7. Thenormal probability plot of residuals, shown in Fig. 8,does not show any significant deviations from normality.In addition, the strong correlation observed in the “ob-served versus fitted” plots together with small standardized

Fig. 7. CPC growth model with approximate 95% normal confi-dence intervals (note: rescaled time scale in minutes).

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 10: A reliability modeling framework for the hard disk drive development process

268 Tang et al.

Table 4. Likelihood ratio tests for models with and without power law variance function and exponential autocorrelation structure

Model description

No. Variance model Autocorrelation modelDegrees of

freedom Log-likelihood TestLikelihood

ratio p-Value

1 No No 6 −34.6352 No �(tj1,tj2)(α) = exp(−α1|tj1 − tj2 |α2 ) 7 5.526 1 vs 2 80.323 <0.00013 g(µ j , θ ) = µθ

j �(tj1,tj2)(α) = exp(−α1|tj1 − tj2 |α2 ) 8 7.536 2 vs 3 4.019 0.0451

residuals show the explanatory power of the proposedmodel.

3.3. Non-parametric regression model

Apart from the use of a parametric regression model dis-cussed in the preceding and given our objective of compar-ing drive designs and assessing reliability improvements,non-parametric regression techniques can be used to esti-mate the CPC growth curve. The general form underlyingthe approach is given as yj = κ(tj ) + e j , where E(e j ) = 0;Var(e j ) = σ 2

j < ∞. Since the CPF is a non-decreasing func-tion over time, a form of scatter plot smoothing thatfirst produces a non-parametric locally weighted scatterplot smooth (LOWESS) before searching for a mono-tone approximation of the smooth is used (Friedmanand Tibshirani, 1984). The monotone approximation isobtained by the pool-adjacent-violators algorithm (Barlowet al., 1972). The resulting non-parametric smooth (Fig.9) is defined as the monotone LOWESS (m-LOWESS)smooth. The upper 95% bootstrap percentile intervalsbased on non-parametric bootstrap on pairs of data wereevaluated from 1000 samples. The m-LOWESS smooth isfully specified by the proportion of data points used for eachlocal fit (α) and the degree of the local polynomial. Given

the monotone nature, local linear models are sufficient.The non-parametric regression model essentially serves asan alternative means for predicting CPC growth. Using amonotone local linear model produced comparable life-time prediction results as the parametric GNLM describesan asymptotic constant rate of CPC growth. Other choicesof α may produce fits closer to the parametric GNLM.

4. Prediction of lifetime distribution

For a specific functional form of the deterministic com-ponent ( f (t)) in the GNLM, the TTF distribution can beevaluated using sample estimates from:

FCPC(t) =

(ln( f (t)) − µ

σ

),

where µ and σ represents the location and scale parame-ters of the lognormal CTF distribution, respectively. Sincewe are interested in obtaining a conservative estimate of thelifetime distribution for reliability assessments at the designstage, we need to take estimation uncertainty (e.g., samplingand measurement errors) into account. Such a conservativeestimate can be obtained from the lower bounds of the TTFdistribution. Given that design engineers for micro-HDDs

Fig. 8. (a) Normal probability plot of standardized residuals; and (b) observed vs fitted values.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 11: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 269

Fig. 9. CPC parametric growth model with 95% confidence andm-LOWESS smooth (α = 0.5) with 95% bootstrap percentileintervals (note: rescaled time scale in minutes).

are primarily interested in lifetime distribution over therange of 100–10 000 parts failed per million in 5 years, aconservative 90% lower confidence bounds based on Bon-ferroni intervals is

FCPC(t) =

(ln( f̄ (t)) − µ

σ̄

),

where f̄ (t) is the 95% normal approximate upper confi-dence bound for CPC growth curve and for the CTF dis-tribution, µ is the lower 97.5% confidence limit for µ̂ andσ̄ is the upper 97.5% confidence limit for σ̂ of the log-normal parameters (Bickel and Doksum, 2000). The pre-dicted lifetime distribution and lower confidence intervalsbased on a parametric GNLM are shown in Fig. 10. Ina practical setting, the conservative lower bound for the

Fig. 10. TTF distribution and conservative 90% lower confidence bound (note: rescaled time scale in minutes).

Fig. 11. Reliability comparisons of micro-HDD designs with and without comb addition (note: rescaled time scale in minutes).

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 12: A reliability modeling framework for the hard disk drive development process

270 Tang et al.

Fig. 12. Reliability growth tracking for drives of different vintages (note: rescaled time scale in minutes).

100-ppm quantile can be evaluated by setting FCPC(t) =100 ppm.

5. Applications

Two applications of the proposed framework are for designselection and reliability growth assessments across differentvintages of HDDs. In an application of the framework fordesign selection, the design change to be made is the addi-tion of a comb-like device which is expected to dampenflow-induced vibration within the G2 series of micro-HDDs. Under higher rates of particle injections, the CTFdistributions for the drives before and after this designmodification are found to follow the lognormal distribu-tion. The CPC growth functions for the two designs areestimated using prime drives. The TTF distribution is thenderived from the CTF distribution using the parametricCPC growth curve estimate as shown in Fig. 11. From thiscomparison, it appears that the addition of combs man-ages to reduce variability in the failure time distribution.Reliability improvements across different vintages of harddisks can also be tracked using the proposed reliability

Table 5. Table of parameter estimates and corresponding confi-dence intervals

ParameterLower 95%

confidence intervalEstimated

valueUpper 95%

confidence interval

β1 −3.867 −2.378 −0.889β2 0.478 0.713 0.948β4 −7.483 −7.257 −7.031βτ 590.098 765.981 941.864βγ 0.0014 0.0027 0.0039

assessment framework. In a typical application, the CPCgrowth curve is first established. The TTF distributions ofthe different design vintages can then be established andcompared through the CTF distributions estimated withaccelerated particle injections. The TTF distributions andlower confidence intervals for G1 and G2 series of drivesare shown in Fig. 12.

6. Conclusions

In this article, a generic failure susceptibility measure,known as the CPC, is proposed as an alternative to timefor the reliability assessment of micro-HDDs at the designstage. A reliability modeling framework utilizing the CTFdistribution and a CPC growth curve to establish the TTFdistribution of HDDs within the design stage is proposed.A new non-linear change-point model and correspondingestimation procedures are proposed within this frameworkfor modeling the CPC growth. The use of a CPC-based fail-ure susceptibility measure enables the effective accelerationof HDD failure, thus enabling design flaws to be revealedand reliability to be managed at the design stage withoutthe design cycle time being compromised.

From a more general perspective, the key for successfulimplementation of the proposed approach is to be ableto identify the primary failure mechanism which typicallyoccurs at interfaces between different mediums or materialssuch as to allow for the establishment of a plausible failuresusceptibility scale. This scale has to be consistent with theobserved failure phenomenon and related to the increasein opportunities of failure, thus allowing for “acceleration”in real time.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 13: A reliability modeling framework for the hard disk drive development process

A reliability modeling framework for HDDs 271

Acknowledgement

The authors would like to thank the three anonymous refer-ees and Professor David Coit for their valuable suggestionsand inputs on this work.

References

Atchison, J. and Brown, J.A.C. (1957) The Lognormal Distribution, Cam-bridge University Press, New York, NY.

Bacon, D.W. and Watts, D.G. (1971) Estimating the transition betweentwo intersecting straight lines. Biometrika, 58, 525–534.

Barlow, R.E., Bartholomew, D.J., Bremner, J.M. and Brunk, H.D. (1972)Statistical Inference Under Order Restrictions, Wiley, London, UK.

Bates, D.M. and Watts, D.G. (1988) Nonlinear Regression Analysis andIts Applications, Wiley, New York, NY.

Bergin, M. and Koka, R. (1993) Measurement of particulate contamina-tion levels in disks with aerosol counters, in Advances in InformationStorage Systems, Volume 5, Bhushan, B. (ed), ASME Press, NewYork, NY, pp. 387–395.

Bhushan, B. (1996) Tribology and Mechanics of Magnetic Storage De-vices, Springer, New York, NY.

Bickel, P.J. and Doksum, K.A. (2000) Mathematical Statistics, Basic Ideasand Selected Topics, Prentice Hall, Upper Saddle River, NJ.

Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for RepeatedMeasurement Data, Chapman and Hall, Boca Raton, FL.

Diggle, P.J. (1988) An approach to the analysis of repeated measurements.Biometrics, 44, 959–971.

Duchesne, T. and Lawless, J. (2000) Alternative time scales and failuretime models. Lifetime Data Analysis, 6, 157–179.

Eliashberg, J., Singpurwalla, N.D. and Wilson, S.P. (1997) Calculatingthe reserve for a time and usage indexed warranty. ManagementScience, 43(7), 966–975.

Friedman, J. and Tibshirani, R. (1984) The monotone smoothing ofscatterplots. Technometrics, 26(3), 243–350.

Gertsbakh, I. and Kordonsky, K.H. (1969) Models of Failures, Springer,Berlin, Germany.

Hiller, B. (1995) Handbook of Micro/Nanotribology, CRC Press, BocaRaton, FL.

Hiller, B. and Brown, B. (1994) Interaction of individual alumina parti-cles with the head/disk interface at different speeds, in Advances inInformation Storage Systems, Volume 5, Bhushan, B. (ed), ASMEPress, New York, NY, pp. 351–361.

Hiller, B. and Singh, G. (1991) Interaction of contaminant particles withthe particulate slider disk interface, in Advances in Information Stor-age Systems, Volume 2, Bhushan, B. (ed), ASME Press, New York,NY, pp. 173–180.

Koka, R. and Kumaran, A.R. (1991) Visualization and analysis of par-ticulate buildup on the leading edge tapers of sliders, in Advances inInformation Storage Systems, Volume 2, Bhushan, B. (ed), ASMEPress, New York, NY, pp. 161–171.

Lawless, J. (1995) Methods of estimation of failure distribution and ratesfrom automobile warranty data. Lifetime Data Analysis, 1, 227–240.

Li, Y.F. and Sharma, V. (2000) The spin off of particles on a magneticdisk. Transactions of the ASME. Journal of Tribology, 122, 293–299.

Lindstrom, M.J. and Bates, D.M. (1990) Nonlinear mixed effects modelsfor repeated measures data. Biometrics, 46, 673–687.

Liu, B., Soh, S.H., Chekanov, A., Hu, S.B. and Low, T.S. (1996) Particlebuild-up on flying sliders and mechanism study of disk wear andhead-disk interface failure in magnetic disk drives. IEEE Transac-tions on Magnetics, 32(5), 3687–3689.

Makis, V., Jiang, S. and Jardine, A.K.S. (1998) A condition based main-tenance model. IMA Journal of Management Mathematics, 9, 201–210.

Makis, V., Wu, J. and Gao, Y. (2006) An application of DPCA to oil datafor CBM modeling. European Journal of Operational Research, 174,112–123.

Meeker, W.Q. and Escobar, L.A. (1998) Statistical Methods for ReliabilityData, Wiley, New York, NY.

Meeker, W.Q. and Escobar, L.A. (2006) SPLIDA (S-PLUS Life DataAnalysis). Available at http://www.public.iastate.edu/∼ splida/

Nelson, W. (2003) Recurrent Events Data Analysis for Product Repairs,Disease Recurrences, and Other Applications, Society for Industrial& Applied Mathematics, Philadelphia, PA.

Park, H.J., Yoo, Y.C., Bae, G.N. and Huang, J.H. (1999) Investigation ofparticle generation in a hard disk drive during the start/stop period.IEEE Transactions on Magnetics, 35(5), 2439–2441.

Pinheiro, J.C. and Bates, D.M. (2002) Mixed Effects Models in S andS-Plus, Springer, New York, NY.

Ratkowsky, D.A. (1990) Handbook on Nonlinear Regression Models, Mar-cel Dekker, New York, NY.

Seber, G.A.F. and Wild, C.J. (1989) Nonlinear Regression, Wiley, Hobo-ken, NJ.

Shah, S. and Elerath, J.G. (2004) Drive vintage and its effect on relia-bility, in Proceedings of the Annual Reliability and MaintainabilitySymposium, IEEE, Piscataway, NJ, pp. 163–167.

Singpurwalla, N.D. and Wilson, S.P. (1998) Failure models indexedby two time scales. Advances in Applied Probability, 30, 1058–1072.

Strom, B.D., Lee, S., Tyndall, G.W. and Khurshdov, A. (2007) Hard diskdrive reliability modeling and failure prediction. IEEE Transactionson Magnetics, 43(9), 3676–3684.

Tang, L.C., Lam, S.W., Ng, Q.Y. and Goh, J.S. (2007) Efficient reliabilitypredictions of particle-induced hard disks failures in product devel-opment, in Proceedings of the Annual Reliability and MaintainabilitySymposium, IEEE, Piscataway, NJ, pp. 259–264.

Tang, L.C., Ng, Q.Y., Cheong, W.T. and Goh, J.S. (2006) Reliabilityassessment for particle-induced failures in multi-generation harddisk drives. Microsystems Technology, 13(8), 891–894.

Tobias, P.A. and Trindade, D.C. (1998) Applied Reliability, Chapman andHall, Boca Raton, FL.

Vonesh, E.F. and Carter, R.L. (1992) Mixed effects nonlinear regressionfor unbalanced repeated measures. Biometrics, 48, 1–17.

Vonesh, E.F. and Chinchilli, V.M. (1997) Linear and Nonlinear Modelsfor the Analysis of Repeated Measurements, Marcel Dekker, NewYork, NY.

Yoon, E.S. and Bhushan, B. (2001) Effect of particulate concentration,materials and size on the friction and wear of a negative-pressurepicoslider flying on a laser-textured disk. Wear, 247, 180–190.

Biographies

Loon Ching Tang is an Associate Professor and Head of the Departmentof Industrial and Systems Engineering in the National University ofSingapore. He obtained his Ph.D. degree from Cornell University inthe field of Operations Research in 1992. He has published widely invarious international peer-review journals, including IEEE Transactionson Reliability, Journal of Quality Technology, Naval Research Logisticsand Queueing Systems. Besides being on the editorial review board ofthe Journal of Quality Technology, he has been an active reviewer for anumber of international journals and has been consulted on problemsdemanding innovative applications of probability, statistics and otheroperations research techniques.

Shao Wei Lam is a Research Fellow with the National University ofSingapore. He has a B.Eng. in Mechanical Engineering and M.Eng.in Industrial and Systems Engineering from NUS. Apart from in-dustrial research experience as a research engineer with the De-sign Technology Institute, he has worked as a senior policy analyst

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014

Page 14: A reliability modeling framework for the hard disk drive development process

272 Tang et al.

with the Agency for Science and Technology which oversees sci-ence and technological developments in Singapore. He is also a Cer-tified Reliability Engineer as certified by the American Society ofQuality.

Quock Y. Ng has worked as a Principal Engineer with Seagate TechnologyInternational, R&D Engineering, R&D and Design Center in Science

Park, Singapore. He is engaged in R&D activities involving multipledisciplines important to hard disk drives. He obtained his B.S. and Ph.D.in Chemistry from City College of New York and University of Alabama,Tuscaloosa.

Jing Shi Goh is a research engineer for Seagate Technology Internationaland is currently working in the area of aerosol contamination.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

alif

orni

a, S

an F

ranc

isco

] at

11:

04 0

5 D

ecem

ber

2014