7
Prognostics-based risk mitigation for telecom equipment under free air cooling conditions Jun Dai, Diganta Das, Michael Pecht Center for Advanced Life Cycle Engineering (CALCE), Mechanical Engineering, University of Maryland, College Park, MD 20742, USA highlights " Analyze the potential risks arising from free air cooling (FAC) in data centers. " Implement a prognostics-based approach to mitigate risks of FAC. " Present a case study to show the prognostics-based method implementation. " Enable the implementation of FAC in data centers not originally designed for FAC. article info Article history: Received 27 January 2012 Received in revised form 16 May 2012 Accepted 31 May 2012 Available online 2 July 2012 Keywords: Data center Free air cooling Reliability Risk mitigation Prognostics and health management (PHM) abstract The telecommunications industry is becoming increasingly conscious of energy consumption and the environmental footprint of its data centers. One energy-efficient approach, free air cooling, uses ambient air instead of air conditioning to cool data-center equipment. Free air cooling is being adopted in existing data centers with equipment that has not been designed or qualified for a free air cooling regime. Tradi- tionally, product qualification is based on passing tests based on industry standards. The industry stan- dards are assumed to conform to some expected environmental conditions. However, environmental conditions under free air cooling may go beyond those expected conditions. This paper identifies the per- formance and reliability risks associated with the implementation of free air cooling. A prognostics-based approach to assess and mitigate the risks of telecom equipment under free air cooling conditions is devel- oped for risk mitigation. A case study is presented to demonstrate the implementation process of this approach. Ó 2012 Elsevier Ltd. All rights reserved. 1. Introduction Energy consumption is a major operating expense in data cen- ters, which include all the buildings, facilities, and rooms that con- tain data servers, telecommunication equipment, and cooling and power equipment. The worldwide energy consumption of data centers increased by about 56% between 2005 and 2010, and reached about 237 terawatt hours (TW h) 1 in 2010, accounting for about 1.3% of the world’s electricity usage [1]. In the US, data center energy consumption increased by about 36% between 2005 and 2010, reaching 76 TW h and accounting for about 2% of total US elec- tricity consumption in 2010 [1]. Cooling systems (primarily air con- ditioners) in the data centers account for a large part of this energy consumption: in 2009, about 40% of the energy consumed by data centers was for cooling [2,3], as shown in Fig. 1. Free air cooling, which uses ambient air to cool data center equipment, is one approach for saving energy for cooling. This cool- ing method is being increasingly used in industry. Intel demon- strated free air cooling at a 10-megawatt (MW) data center, showing a reduction of energy use and saving US$2.87 million [4]. Microsoft and Google used free air cooling to replace traditional chillers in some of their new data centers in Europe in 2009 [5,6]. Due to the potential energy savings from free air cooling, regula- tions and standards are being updated to facilitate its adoption. For example, the ‘‘EU Code of Conduct on Data Centers’’ [7] recom- mends free air cooling as the preferred cooling method in data cen- ters. In the 2010 version of the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) Standard 90.1 [8], free cooling 2 is recommended in U.S. data centers if they 0306-2619/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.apenergy.2012.05.055 Corresponding author. Tel.: +1 301 405 5323; fax: +1 301 314 9269. E-mail address: [email protected] (M. Pecht). 1 This value is the midpoint of the low bound and high bound of the energy consumption estimation. 2 ASHRAE 90.1 uses the word ‘‘economizer’’, which is another name for ‘‘free cooling’’ in industry. There are generally two kinds of economizers: airside econo- mizers and waterside economizers. Waterside economizers are not included in this paper because they re-circulate water instead of air. Applied Energy 99 (2012) 423–429 Contents lists available at SciVerse ScienceDirect Applied Energy journal homepage: www.elsevier.com/locate/apenergy

Prognostics-based risk mitigation for telecom equipment under free air cooling conditions

Embed Size (px)

Citation preview

Applied Energy 99 (2012) 423–429

Contents lists available at SciVerse ScienceDirect

Applied Energy

journal homepage: www.elsevier .com/locate /apenergy

Prognostics-based risk mitigation for telecom equipment under free aircooling conditions

Jun Dai, Diganta Das, Michael Pecht ⇑Center for Advanced Life Cycle Engineering (CALCE), Mechanical Engineering, University of Maryland, College Park, MD 20742, USA

h i g h l i g h t s

" Analyze the potential risks arising from free air cooling (FAC) in data centers." Implement a prognostics-based approach to mitigate risks of FAC." Present a case study to show the prognostics-based method implementation." Enable the implementation of FAC in data centers not originally designed for FAC.

a r t i c l e i n f o

Article history:Received 27 January 2012Received in revised form 16 May 2012Accepted 31 May 2012Available online 2 July 2012

Keywords:Data centerFree air coolingReliabilityRisk mitigationPrognostics and health management (PHM)

0306-2619/$ - see front matter � 2012 Elsevier Ltd. Ahttp://dx.doi.org/10.1016/j.apenergy.2012.05.055

⇑ Corresponding author. Tel.: +1 301 405 5323; faxE-mail address: [email protected] (M. Pecht).

1 This value is the midpoint of the low bound anconsumption estimation.

a b s t r a c t

The telecommunications industry is becoming increasingly conscious of energy consumption and theenvironmental footprint of its data centers. One energy-efficient approach, free air cooling, uses ambientair instead of air conditioning to cool data-center equipment. Free air cooling is being adopted in existingdata centers with equipment that has not been designed or qualified for a free air cooling regime. Tradi-tionally, product qualification is based on passing tests based on industry standards. The industry stan-dards are assumed to conform to some expected environmental conditions. However, environmentalconditions under free air cooling may go beyond those expected conditions. This paper identifies the per-formance and reliability risks associated with the implementation of free air cooling. A prognostics-basedapproach to assess and mitigate the risks of telecom equipment under free air cooling conditions is devel-oped for risk mitigation. A case study is presented to demonstrate the implementation process of thisapproach.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction Free air cooling, which uses ambient air to cool data center

Energy consumption is a major operating expense in data cen-ters, which include all the buildings, facilities, and rooms that con-tain data servers, telecommunication equipment, and cooling andpower equipment. The worldwide energy consumption of datacenters increased by about 56% between 2005 and 2010, andreached about 237 terawatt hours (TW h)1 in 2010, accounting forabout 1.3% of the world’s electricity usage [1]. In the US, data centerenergy consumption increased by about 36% between 2005 and2010, reaching 76 TW h and accounting for about 2% of total US elec-tricity consumption in 2010 [1]. Cooling systems (primarily air con-ditioners) in the data centers account for a large part of this energyconsumption: in 2009, about 40% of the energy consumed by datacenters was for cooling [2,3], as shown in Fig. 1.

ll rights reserved.

: +1 301 314 9269.

d high bound of the energy

equipment, is one approach for saving energy for cooling. This cool-ing method is being increasingly used in industry. Intel demon-strated free air cooling at a 10-megawatt (MW) data center,showing a reduction of energy use and saving US$2.87 million [4].Microsoft and Google used free air cooling to replace traditionalchillers in some of their new data centers in Europe in 2009 [5,6].Due to the potential energy savings from free air cooling, regula-tions and standards are being updated to facilitate its adoption.For example, the ‘‘EU Code of Conduct on Data Centers’’ [7] recom-mends free air cooling as the preferred cooling method in data cen-ters. In the 2010 version of the American Society of Heating,Refrigerating and Air-Conditioning Engineers (ASHRAE) Standard90.1 [8], free cooling2 is recommended in U.S. data centers if they

2 ASHRAE 90.1 uses the word ‘‘economizer’’, which is another name for ‘‘freecooling’’ in industry. There are generally two kinds of economizers: airside econo-mizers and waterside economizers. Waterside economizers are not included in thispaper because they re-circulate water instead of air.

Utility transmission and distribution loss, 7%

Lighting and other, 3%

IT equipment, 44%

Power, 6% Cooling, 40%

Fig. 1. Power consumption distribution in data centers [3].

424 J. Dai et al. / Applied Energy 99 (2012) 423–429

are not located in climate zones 1a and 1b3 with a cooling capacityreaching 54,000 BTU4/h.

1.1. Implementation of free air cooling with airside economizers

Free air cooling uses outside ambient air to cool equipment indata centers directly. Free air cooling typically involves the use ofairside economizers when the conditions of the ambient air arewithin the required operating condition ranges. Fig. 2 shows anexample of an airside economizer [35]. Generally, an airside econ-omizer comprises sensors, ducts, dampers, and containers that ad-mit the appropriate volume of air in the right temperature range(set by operators) into the installation with smart conditioning ofthe incoming and exhaust air streams. This set temperature rangecan be based on the recommended operating ranges given by pub-lished standards, such as ASHRAE [10], Telcordia GR-63-CORE [11],and Telcordia GR-3028-CORE [12]. If the outside ambient air con-ditions are within this range, or if they can be brought within therange by mixing cold outside air with warm return air, then out-side air can be used for data centers for cooling via an airside econ-omizer fan. If the conditions achievable by economization andmixing of outside air are outside the pre-set ranges, a coolingand heating control system will be used to adjust the supply airconditions to within the pre-set ranges. In some cases, the airsideeconomizer can be isolated in favor of a back-up air-conditioningsystem. Generally, the energy saving potential of free air coolingvaries with the supply air temperature and the setting of the roomtemperature control.

1.2. Related research

The current literature primarily addresses how to maximize theenergy savings from free air cooling. Bulut and Aktacir [13–15]studied free air cooling with a case study in Istanbul, Turkey. Theyfound that the potential savings from free air cooling variedthroughout the months in a year; the season transition months(April, May, September, and October) usually had the highest en-ergy saving potential. However, they found that the benefits fromfree air cooling from June to August are not significant due to highoutdoor air temperatures. They also analyzed the effects of the setsupply air temperature on the potential savings, and the resultsshowed that the savings could increase by about 34% if the set sup-ply air temperature increased from 15 �C to 24 �C. Sorrentino et al.[16] simulated the optimization of the energy management of tele-com switching plants and claimed that proper setting of the supplyair temperature range could save up to 25% energy for cooling.

3 In the 2010 version of ASHRAE Standard 90.1, the United States is divided intoclimate zones: 1a, 1b, 2a, 2b, 3a, 3b, 3c, 4a, 4b, 4c, 5a, 5b, 5c, 6a, 6b, 7, and 8. Climatezones 1a and 1b are the very hot areas, such as Miami.

4 BTU is British Thermal Units. It is a unit of measurement of thermal energy and isequal to 1055.05585 J.

Dovrtel and Medved [17] incorporated weather forecasts into thefree air cooling control system and then optimized the regime ofoperating conditions to achieve improved efficiency. These studiesdid not address any possible risks from free air cooling.

Some companies have performed research on the reliability ofequipment and have obtained preliminary results. For example, In-tel claimed that there is only a small difference between the 4.46%failure rate of equipment with free air cooling and the 3.83% failurerate of those with traditional air conditioned cooling over a 10-month period [4]. No economic analysis of the impact of thechange in the failure rate is offered to show that the impact of freeair cooling is low. However, the project duration of 10 months wasshort compared with the lifetime of data centers and equipment,which is typically on the order of 5–10 years. As a result, it is notknown if the differential in the failure rates will increase with timeor stay the same.

Dell ran a data center with free air cooling at 40 �C and 85%RHfor more than 12,000 h5 [18], and their results showed only a smalldifference in the number of hard failures compared with the controlcell operated within the ASHRAE recommended range. But as in theIntel case, the Dell case also focused mainly on server hardware;communication equipment, such as routers and switches, were notincluded in the tests. Clearly, more research is needed to identifythe risks of free air cooling on telecom equipment in data centers.

In this paper, we analyze the potential risks arising from free aircooling based on the expected operating conditions. Then, we pro-vide a risk mitigation approach based on prognostics and healthmanagement (PHM), which is an enabling discipline consisting oftechnologies and methods to assess the reliability of a product inits actual life cycle conditions to determine the advent of failuresand mitigate system risks [19]. A case study is then presented todemonstrate the implementation of this approach on an examplepiece of hardware.

2. Potential risks associated with free air cooling

The design rules, test conditions, acceptance conditions, andoverall operating cost estimates for data center equipment are im-pacted by their operating conditions. Free air cooling changes theoperating conditions for the equipment in data centers comparedwith the operating conditions under traditional air conditioning.This section analyzes the potential reliability risks associated withthe implementation of free air cooling.

ASHRAE published ‘‘Thermal Guidelines for Data Centers andOther Data Processing Environments’’ for data center operatingconditions in 2004 [9]. These guidelines provide the recommendedlimits of 20–25 �C and 40–55%RH and the allowable limits of 15–32 �C and 20–80%RH in data centers, respectively. The 2008 revisedspecifications expanded the recommended limits to 18–27 �C,which allows for more operating hours of airside economizers infree air cooling mode [10]. Telcordia Generic Requirements GR-63-CORE [11] and GR-3028-CORE [12] also provide operating con-dition specifications for telecom equipment. The recommendedlimits are 18–27 �C and 5–55%RH, and the allowable limits are5–40 �C6 and 5–85%RH, which are slightly different from those ofASHRAE.

Most data centers operate with inlet temperatures between24 �C and 27 �C [20], and humidity levels in data centers are

5 Dell claimed that it was equivalent to more than 7 years of the worst case of freeair cooling in an EU climate.

6 The temperature in the ASHRAE standard is the inlet temperature of the telecomequipment; however, the temperature in Telcordia is the ambient temperature of thetelecom equipment.

Fig. 2. Schematic of airside economizer with airflow path [9].

J. Dai et al. / Applied Energy 99 (2012) 423–429 425

typically maintained between 35%RH and 55%RH [21]. With theimplementation of free air cooling, the operating conditions maybe significantly changed. For example, the inlet temperature rangewas 18–32 �C in the Intel case, and the relative humidity range was4%RH to more than 90%RH [4]. Depending on the energy savinggoals and the local climate, the data center equipment may expe-rience greater operating condition variations under free air coolingconditions. The potential risks to equipment reliability from theseoperating conditions are discussed in the following sections.

In a traditional data center, there are multiple air conditioningunits that can be used to ‘‘fine tune’’ data center temperature inaddition to the use of air flow. In contrast, the temperature acrossthe data center in a free air cooled data center is controlled largelyby air flow. If a data center is not optimized for free air cooling,variations in air flow patterns could create new hot spots or exac-erbate existing ones.

Naturally occurring temperature and humidity changes cancause cycling-induced damage that contributes to the cumulativedamage of components and assemblies, including humidity-induced corrosion-related failure mechanisms. Typically, relativehumidity levels in data centers are maintained between 35% and55%RH. This control provides effective protection against a numberof failure mechanisms, such as electrochemical migration and con-ductive anodic filament (CAF) formation. The introduction of out-side air could result in significant variation in ambient humidity.For example, in the Intel trial, the relative humidity varied from4% to over 90%RH [4]. Both high and low humidity regimes canactivate some failure mechanisms. CAF can be accelerated by highhumidity, while electrostatic discharge (ESD) is more common inlow humidity.

The operating temperature is usually increased under free aircooling conditions compared to the operating temperature undertraditional air conditioning; this may pose risks to some compo-nents which are sensitive to temperature changes. For example,the uninterruptable power supply (UPS) batteries used in data cen-ters usually have temperature-dependent lifetime expectancies; anincrease in temperature may accelerate the corrosion of bipolarplates in batteries with more water being consumed. Generally,the lifetime of a valve-regulated lead-acid (VRLA) battery is maxi-mized at around 25 �C, and it is estimated that the lifetimes of bat-teries may drop by 50% when the operating temperature increases

by 8 �C [22]. The reliability of batteries may be significantly af-fected by the implementation of free air cooling conditions.

Free air cooling may result in low contamination control [4].Contaminants of concern in data centers include particulates(smoke and dust) and gases (SO2, NO2, O3, HCHO, H2S, and Cl2).Generally, contamination has three main effects on telecom equip-ment: (1) chemical effects—for example, copper creep corrosion oncircuit boards and silver metallization corrosion in miniaturesurface-mounted components; (2) mechanical effects—for exam-ple, heat sink fouling, optical signal interference, and increasedfriction; and (3) electrical effects, such as changes in circuit imped-ance and arcing [23].

Free air cooling may also accelerate wear-out in cooling equip-ment such as fans that are used in equipment such as servers orrouters. Some types of fans are designed with multiple speeds,and when the operating temperature is increased, the cooling sys-tem controls may increase the fan speed and the duty cycle to off-set the temperature increase. This increased speed and duty cyclecan affect the lifetime and reliability of the fans.

Assessment of the risks should be conducted in the equipmentand system design stage if possible. However, for the data centersthat were not originally designed for free air cooling, mitigatingrisks at the operation stage will be necessary.

The main hindrance to risk assessment is the lack of availabilityof historical data on data center failure trends. The potential for en-ergy savings offered by free air cooling is promising, and the indus-try cannot afford to wait to get all the risks identified andeliminated before adoption. One technological innovation and con-cept that can help the industry manage the risks while reapingbenefits from free air cooling is prognostics and healthmanagement.

3. Risk assessment and mitigation by prognostics-based healthmanagement

For data centers already in operation, re-qualifying equipmentis not a viable option for evaluating the risks when free air coolingis considered. It is not practical to take equipment out of service fortesting. Even if this were attempted, the tested equipment wouldbe sacrificed, since it would lose useful life in the process. Nor

426 J. Dai et al. / Applied Energy 99 (2012) 423–429

would it be possible to gather an appropriate sample size of sys-tems already in operation. Accelerating the life-cycle conditionsfor an entire data center is also impractical, prohibitively expen-sive, and unlikely to provide useful information on the reliabilityof the system. Instead, we propose using prognostics and healthmonitoring as a retrofitting technique, which can assess and miti-gate the risks for data centers in operation. This technique allowsimplementation of free air cooling in data centers which were ini-tially not designed for this cooling method.

Prognostics and health management (PHM) uses in situ systemmonitoring and data analysis to identify the onset of abnormalbehavior that may lead to either intermittent out-of-specificationperformance or permanent equipment failure. This novel methodneed not interrupt the service of data centers for the purpose ofreliability assessment. PHM permits the assessment of the reliabil-ity of a product (or system) during operation [24]. Generally, PHMcan be implemented using the physics-of-failure (PoF) approach,the data-driven approach, or a combination of both (the fusion ap-proach) [25].

The physics-of-failure approach uses knowledge of a product’slife-cycle loading and failure mechanisms to perform reliability de-sign and assessment [24,26–29]. The data-driven approach usesmathematical analysis of current and historical data to provide sig-nals of abnormal behavior and estimate remaining useful life (RUL)[24]. The fusion approach combines PoF and data-driven modelsfor prognostics [25], overcoming some of the drawbacks of usingeither approach alone.

3.1. A Prognostics-based approach for risk mitigation in free air cooling

A prognostics-based approach is proposed for assessing andmitigating the risks due to the implementation of free air cooling,as shown in Fig. 3. This approach starts with identifying the setoperating condition range under free air cooling conditions. Basedon the identified operating condition range, a failure modes, mech-anisms, and effects analysis (FMMEA) is conducted to identify theweakest subsystems/components, which are the most likely to failfirst in the system.

FMMEA is a methodology that is used to identify critical failuremechanisms and their associated failure sites. It combines tradi-tional failure modes and effects analysis (FMEA) with knowledgeof the physics of failure [30]. A failure mechanism is defined asthe processes by which a specific combination of physical, electri-cal, chemical, and mechanical stresses induces failure. The under-lying failure mechanisms for a system become evident to the

Identification of operating condition

FMMEA & Identification of the weakest

subsystems (parts)

The weakestsubsystems (parts)

monitoring

PHM approaches:•PoF•Data-driven•Fusion model

Anomaly detection

Systemmonitoring

Fig. 3. A prognostics-based approach to mitigating the risks of free air cooling.

user through failure modes, which are tangible observations ofhow the system or device has failed. Overheating, unexpectedshutdown, and reduced performance are observable failure modes.FMMEA uses a life cycle profile to identify active stresses and selectthe potential failure mechanisms. The failure mechanisms need tobe prioritized based on knowledge of load type, level, and fre-quency combined with the failure sites, severity, and likelihoodof occurrence. Several mechanisms may occur at a higher rate infree air cooling conditions due to uncontrolled humidity: electro-chemical migration (often occurs in low relative humidity), con-ductive anodic filament (CAF) formation (often occurs in highhumidity), and creep corrosion (often occurs at high humidity inthe presence of low levels of sulfur-based pollutants). FMMEAcan help identify weak subsystems that may have an increasedpropensity to be susceptible to critical failure mechanisms underfree air cooling conditions.

FMMEA can be further conducted on the weakest sub-systemsto identify the critical failure mechanisms at that level and thekey parameters which indicate the degradation trends of the sys-tem. Under some circumstances, monitoring and data analysiscan also be performed for lower level systems or components[31–33]. Based on the FMMEA results, the parameters of the sys-tem and its weakest subsystems/components (e.g., voltage, cur-rent, resistance, temperature, impedance) will be monitored forrisk assessment and mitigation.

In principle, all three PHM approaches (i.e., PoF, data-driven,and fusion) can be used to perform risk mitigation. The PoF ap-proach is usually not practical for complicated systems with largenumbers of subsystems and components. However, the data frommonitored parameters allows the use of data-driven PHM at thesystem level with only a limited need for additional sensing, mon-itoring, storage, and transmission tools. The data-driven approachdetects system anomalies based on system monitoring that coversperformance (e.g., uptime, downtime, and quality of service) andother system parameters (e.g., voltage, current, resistance, temper-ature, humidity, vibration, and acoustic signal). The data-drivenapproach identifies failure precursor parameters, which are indic-ative of impending failures, based on system performance andthe collected data. Furthermore, the availability of FMMEA andprecursor parameter results for lower-level subsystems and com-ponents permits the use of the data-driven-based PHM approachat those levels.

4. A case study on a piece of network equipment

The network architecture in a data center consists of a set ofrouters and switches. Their function is to send data packets to theirintended destinations. The network equipment selected for thisstudy was the power adapter of a Zonet ZFS 3015P switch, whichis widely used in offices and small enterprises. This hardwarewas selected for its well defined and directly observable failure cri-teria. In this case study, we implemented a data-driven method todetect anomalies in the power adapter to provide early warning offailure and then mitigate the risks. A block diagram of the poweradapter is shown in Fig. 4. For the power adapter, the performanceparameter is the output voltage, which has a rated output of 9 V.The output voltage drops when the power adapter degrades. Weconsidered the power adapter to have failed when the output volt-age goes below 10% of the rated value (i.e., reaches 8.1 V).

4.1. Identification of operating conditions

The first step is to identify the operating condition of the poweradapter. Its rated operating conditions are 0–40 �C and 10–90%RH.The operating conditions are set by data center operators and are

Chamber

SwitchPower

Adapter

Agilent 34970A Data Acquisition D

ata

Pack

et

Data

Packet

Param

eter

Data

Power Supply

SoftwareSupport

Fig. 5. Overview of switch experiment system.

2

1

3

2

2

4

5

#1—THX 202H IC; #2—aluminum electrolytic capacitor s; #3—resistor;

#4—power transformer; #5—output voltage supply.

Fig. 4. Power adapter of a Zonet ZFDS 3015p switch.

J. Dai et al. / Applied Energy 99 (2012) 423–429 427

usually determined by the amount of energy savings which is ex-pected from the implementation of free air cooling. The operatingconditions also depend on the outside air temperature, the speedsof the cooling fans, and the cooling load of the data centers. In addi-tion, computational fluid dynamics (CFDs) or other simulationtools can be used to optimize the operating condition settings. Inthis case, we assumed the operating conditions were 0–50 �C and5–95%RH in order to maximize energy savings. We used a condi-tion of 95 �C and 70%RH in the experiment to increase the rate ofdegradation and observe the failure of a new power adapter withina reasonable time. The power adapter was placed inside an envi-ronmental chamber and was in operation for the duration of theexperiment. An Agilent 34970A data acquisition monitor was usedto monitor and record the parameters.

0

2

4

6

8

10

Vol

tage

(V

)

4.2. FMMEA and identification of weak subsystems

The power adapter in this case is a switched-mode power sup-ply (SMPS). FMMEA can help to identify the critical failure mecha-nisms and then identify the weakest components that are involvedin the critical failure mechanisms and the key parameters thatindicate their degradation. According to the FMMEA results in[34], the critical failure mechanisms are aging of electrolyte, wiremelt due to current overload, thermal fatigue, contact mitigation,time dependent dielectric breakdown, and solder joint fatigue.The components with high risk due to critical failure mechanismsare the aluminum electrolytic capacitor, the diode, the power me-tal oxide semiconductor field effect transistor (MOSFET), the trans-former, and the integrated circuit (IC).

0 2 4 6 8 10 12 14

Time (hour)12.7

Fig. 6. Power adapter output voltage.

0

1

2

3

4

5

0 2 4 6 8 10 12 14

Time (hour)

Vol

tage

(v)

13.2

Fig. 7. The voltage across capacitor 1.

4.3. System and weak subsystem monitoring

An overview of the experiment setup is shown in Fig. 5. In theexperiment, NetIQ Chariot, a network testing software package,was used to send data packets with sizes up to 109 bits continu-ously from one computer to another through a new switch whichwas put inside an environmental chamber at 90 �C and 70%RH. Anew power adapter was used to provide power to the switch. Akey step was to monitor the parameter shifts that indicate the deg-radation trend of the power adapter. With consideration of mea-surement applicability, several parameters of the power adapterwere monitored by an Agilent 34970A data acquisition monitor:the voltages of the three capacitors (shown as #2 in Fig. 4), andthe output frequency of the THX 202H IC (shown as #1 in Fig. 4).In addition, the output voltage across the power adapter was also

monitored for the power adapter performance trends (shown as#5 in Fig. 4).

The next step is to identify the failure precursor parameters ofthe power adapter. During the experiment, the output voltageacross the power adapter experienced temporary drops and thenreturned to the usual voltage (about 9.3 V), which can be consid-ered to be an intermittent failure. Finally, the output voltage de-creased to about 2.3 V permanently after about 12.7 h, and theswitch could not work with the power supply voltage, as shownin Fig. 6.

The voltage across capacitor 1 was roughly constant (3.9 V),with a small drop at 13.2 h, as shown in Fig. 7. Generally, the volt-age across capacitor 2 varied between 22 V and 30 V and increasedto 36 V after the power adapter failed at 12.7 h, as shown in Fig. 8.The voltage across capacitor 3 varied between 3.2 V and 3.7 V, andstayed roughly at 3.2 V after the power adapter failed at 12.7 h, asshown in Fig. 9. The THX 202H IC frequency started at about137 kHz and dropped to less than 10 kHz in response to poweradapter failure at 12.7 h. Before this sharp drop, the frequencyexhibited a gradual shift, which indicated degradation of the out-put voltage. In addition, the frequency also had some temporary

20

24

28

32

36

0 2 4 6 8 10 12 14

Vol

tage

(V

)

12.7Time (hour)

Fig. 8. The voltage across capacitor 2.

0 2 4 6 8 10 12 143

3.2

3.4

3.6

3.8

Time (hour)

Vol

tage

(v)

12.7

Fig. 9. The voltage across capacitor 3.

0 2 4 6 8 10 12 14

Time (hour)

0

40

80

120

160

Fre

quen

cy (

KH

Z)

Anomaly detection

at 8 hours

µµ 05.0+µµ 05.0−

12.7

Fig. 10. THX 202H IC frequency.

Table 1Shifts in the monitored parameters.

Parameter Baseline Final value Shift

Voltage across capacitor 1 3.9 V 3.5 V 11%Voltage across capacitor 2 26.2 V 34.1 V 30%Voltage across capacitor 3 3.5 V 3.2 V 8%IC frequency 137.8 KHz 12.1 KHz 91%

Table 2Correlation coefficients between the monitored parameters and the output voltage.

Capacitor 1voltage

Capacitor 2voltage

Capacitor 3voltage

IC (THX202h)frequency

Correlationcoefficients

0.08 0.58 0.55 0.80

428 J. Dai et al. / Applied Energy 99 (2012) 423–429

drops corresponding to intermittent failures of the power adapter,as shown in Fig. 10.

4.4. Anomaly detection

In this case study, the data-driven approach was used for anom-aly detection. The next step was to identify the failure precursorparameter. When the power adapter failed at 12.7 h, the monitoredparameters shifted, as shown in Table 1. In this table, the baselinesof the monitored parameters are the averages of the first 100 datapoints in the experiment, and the final values are the averages ofthe first 100 data points after the power adapter failure at 12.7 h.As seen in Table 1, the voltages across capacitors 1 and 3 had sim-ilar baselines (3.9 V and 3.5 V) and shifts (11% and 8%), but thevoltage across capacitor 2 started at about 26 V and experienceda 30% shift when the power adapter failed. The THX 202H IC fre-quency had the largest shift—91%—among the monitoredparameters.

The correlation coefficients between the monitored parametersand the output voltage across the power adapter (power adapterperformance) are shown in Table 2. The IC frequency also has thelargest correlation coefficient: 0.80. Based on the above analysis,the shift in IC frequency captures the degradation of the poweradapter, and it can be considered to be the best failure precursorparameter among the monitored parameters.

The trending of the IC frequency can be used to monitor perfor-mance degradation and provide early warning of failure. The

healthy baseline was defined as the mean l (137 kHz) of the first100 data points (50 min). The anomaly threshold of the failure pre-cursor parameter is usually selected by the engineers who mayhave different decisions for the same case. Generally, there is use-ful information available to help select appropriate anomalythreshold values of the failure precursor parameter, such as: itshealthy values, its historical data (e.g., its trending and failure val-ues), and its reference (healthy) range provided in the data sheet.Information about similar products or the same technology familyproduct is also useful for the appropriate selection of the anomalythreshold value. As an example, we selected the anomaly thresholdas five consecutive data points (2.5 min) beyond 5% of l. Then, theanomaly was detected at 8 h, which was 0.9 h before the first inter-mittent drop in output voltage and 4.7 h before permanent failure,as shown in Fig. 10.

5. Conclusions

Prognostics and health management can identify and mitigatethe risks to data center equipment arising from free air cooling.The case study presented in this paper shows that the monitoringof equipment parameters can provide early failure warnings in theform of an alarm. When the remaining useful life is estimatedusing a suitable algorithm, data centers can schedule maintenanceor the replacement of equipment to avoid unplanned downtime.This method also helps to identify whether the equipment failureis intermittent or permanent without having to interrupt data cen-ter service.

With the requirement of economizers in the 2010 version ofASHRAE Standard 90.1, free air cooling will be increasingly imple-mented in existing and new data centers. Our prognostics-basedmethod for identifying and mitigating risks can enable the imple-mentation of free air cooling in data centers which were not orig-inally designed for this cooling method. Furthermore, it can alsoenable data centers to perform predictive maintenance (condi-tion-based maintenance) instead of preventive maintenance (rou-tine or time-based maintenance) by providing early warnings offailure, which can reduce both the costs of equipment replacementand service downtime. These benefits would be especially usefulfor mission-critical data centers.

When the next generation of data center equipment is designedfor free air cooling, the primary concern should be the local oper-ating conditions of the electrical parts, since this will be the decid-ing factor in data center reliability and availability. Toward thatgoal, PHM can help by gathering valuable life-cycle data. Theknowledge of the life cycle and the operating parameters gainedby using PHM in data centers will also help risk mitigation at tele-communications base stations that operate unattended at remote

J. Dai et al. / Applied Energy 99 (2012) 423–429 429

locations and where variations in the outside environment areeven greater. Implementing PHM at such installations is criticalfor improving global data and voice communication networksand delivering the benefits of technological advances to the farreaches of the globe.

Acknowledgement

The authors would also like to thank the members of the Prog-nostics and Health Management Consortium at CALCE for theirsupport of this work.

References

[1] Koomey JG. Growth in data center electricity use 2005–2010. Oakland,CA: Analytics Press; 2011.

[2] Almoli A, Thompson A, Kapur N, Summers J, Thompson H, Hannah G.Computational fluid dynamic investigation of liquid rack cooling in datacentres. Appl Energy 2012;89(1):150–5.

[3] Johnson P, Marker T. Data center energy efficiency product profile, Pitt &Sherry, report to equipment energy efficiency committee (E3) of TheAustralian Government Department of the Environment, Water, Heritage andthe Arts (DEWHA); 2009 April.

[4] Intel information technology. Reducing data center cost with an aireconomizer. IT@Intel Brief; computer manufacturing. Energy efficiency;August 2008.

[5] Miller R. Microsoft’s chiller-less data center. Data Center Knowl2009.

[6] Miller R. Google’s chiller-less data center. Data Center Knowl 2009(July).[7] European commission. Code of conduct on data centres energy efficiency—

version 2.0; November 2009.[8] American Society of Heating, Refrigerating and Air-Conditioning Engineers

(ASHRAEs). In: Energy standard for buildings except low-rise residentialbuildings, Atlanta, GA; October 2010.

[9] Energy Star. Air-side economizer. <http://www.energystar.gov/index.cfm?c=power_mgt.datacenter_efficiency_economizer_airside>. [accessed 6.2012].

[10] ASHRAE, 2008 ASHRAE environmental guidelines for datacom equipment,Atlanta; 2008.

[11] Bell Communications Research Inc., Generic requirements GR-63-CORE. In:Network equipment-building system (NEBS) requirements: physicalprotection, Piscataway, NJ; March 2006.

[12] Bell Communications Research Inc., Generic requirements GR-3028-CORE. In:Thermal management in telecommunications central offices, Piscataway, NJ;December 2001.

[13] Bulut H, Aktacir MA. Determination of free cooling potential: a case study forIstanbul, Turkey. Appl Energy Mar. 2011;88(3):680–9.

[14] Aktacir MA, Bulut H. Investigation of free cooling potential of kayseri province.In: Proceedings of 16th national thermal science and technique congress, vol.2, Kayseri, Turkey; 2007. p. 860–866.

[15] Aktacir MA, Bulut H. Temperature controlled free cooling and energy analysisin all air conditioning systems. In: Proceedings of second national airconditioning congress, Antalya, Turkey; 2007. p. 151–161.

[16] Sorrentino M, Rizzo G, Genova F, Gaspardone M. A model for simulation andoptimal energy management of telecom switching plants. Appl Energy2010;87(1):259–67.

[17] Dovrtel K, Medved S. Weather-predicted control of building free coolingsystem. Appl Energy 2011;88(9):3088–96.

[18] Homorodi T, Fitch J. Fresh air cooling research. Dell Techcenter 2011(August).[19] Cheng S, Azarian M, Pecht M. Sensor systems for prognostics and health

management. Sensors 2010;10:5774–97.[20] Moss DL. Data center operating temperature: the sweet spot. Dell Technical

White Paper 2011(June).[21] Shehabi A, Tschudi W, Gadgil A, Data center economizer contamination and

humidity study. In: Emerging technologies program application assessmentreport to pacific gas and electric company; March 2007.

[22] American power conversion. Battery technology for data centers and networkrooms: VRLA reliability and safety. White Paper # 39, W. Kingston, RI; 2002.

[23] ASHRAE. Particulate and gaseous contamination in datacom environments.ISBN – 9781933742601; 2009.

[24] Pecht M. Prognostics and health management of electronics. New York,NY: Wiley-Interscience; 2008.

[25] Jaai R, Pecht M. Fusion prognostics. In: Proceedings of sixth DSTO internationalconference on health & usage monitoring, Melbourne, Australia; March 2009.

[26] Lall P, Pecht M, Cushing MJ. A physics-of-failure (PoF) approach to addressingdevice reliability in accelerated testing. In: 5th European symposium onreliability of electron devices, failure physics and analysis, Glasgow, Scotland;October 1994.

[27] Gu J, Pecht M. Prognostics-based product qualification. In: 2009 IEEEaerospace conference, Big Sky, Montana; March 2009.

[28] Gu J, Pecht M. Physics-of-failure-based prognostics for electronic products.Trans Inst Meas Contr 2009;31(3/4):309–22.

[29] Gu J, Pecht M. Prognostics implementation of electronics under vibrationloading. Microelectron Reliab 2007;47(12):1849–56.

[30] Wang W, Azarian M, Pecht M. Qualification for product development. In: 2008International conference on electronic packaging technology & high densitypackaging, Shanghai, China; July 2008.

[31] Oh H, Shibutani T, Pecht M. Precursor monitoring approach for reliabilityassessment of cooling fans. J Intell Manuf 2009(November). http://dx.doi.org/10.1007/s10845-009-0342-2.

[32] Patil N, Celaya J, Das D, Goebel K, Pecht M. Precursor parameter identificationfor insulated gate bipolar transistor (IGBT) prognostics. IEEE Trans. Reliab2009;58(2).

[33] Kwon D, Azarian MH. Early detection of interconnect degradation bycontinuous monitoring of RF impedance. IEEE Trans Device Mater Reliab2009;9(2):296–304.

[34] Mathew S, Alam M, Pecht M. Identification of failure mechanisms to enhanceprognostic outcomes. In: MFPT: the applied systems health managementconference, Virginia Beach, Virginia; May 10–12, 2011.

[35] ASHRAE TC 9.9 2004, Thermal guidelines for data processing environments;2004.