Reliaility · 12 ASHRAE JOURNAL ashraeor OCTOBER Data Center Controls Reliability BY JEFF STEIN, P.E., MEMBER ASHRAE; BRANDON GILL, P.E., MEMBER ASHRAE For high reliability data centers,

ReliabilityData Center Controls

Grid Coordination & Net Zero Energy Projects | The New Standard 90.2

Understanding Adaptation in the Built Environment | ASHRAE Research Report

OCTOBER 2018

A S H R A EJ O U R N A L

THE MAGAZINE OF HVAC&R TECHNOLOGY AND APPLICATIONS ASHRAE.ORG

A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 81 2

Data Center Controls ReliabilityBY JEFF STEIN, P.E., MEMBER ASHRAE; BRANDON GILL, P.E., MEMBER ASHRAE

For high reliability data centers, there is general agreement in the design commu-nity on the need for redundancy for certain mechanical equipment (e.g., N+1 pumps, chillers, and cooling units). For other mechanical equipment (e.g., redundant piping), there is little agreement, but at least the options are fairly clear. However, there is far less agreement on and understanding of the redundancy requirements and options for the controls components used to monitor and control the mechanical systems.

Controls redundancy may be less visible and less well

understood than mechanical redundancy, but it is no

less important. In fact, in the authors’ experience, more

major data center cooling system failures are due to

poor controls design and implementation than are due

to mechanical equipment failures. Well-designed con-

trol systems must recognize and respond to the possible

failure or degradation of any device, including any con-

troller, sensor, actuator, variable frequency drive (VFD),

fan, pump, chiller, power source, electrical circuit, and

controls communication path. This article discusses

how to design and commission data center controls for

maximum reliability. Five key areas are addressed:

1. Control system architecture design and associated

controlled device configuration;

2. Redundant sensor requirements;

3. Fault responsive sequences of operation;

4. Alarming and notification requirements; and

5. Commissioning.

Controls System Architecture and Controlled Device Configuration

A good data center controls design does not necessar-

ily require controller redundancy. In fact, controller

redundancy can reduce reliability due to added com-

plexity and additional points of failure.

TECHNICAL FEATURE

Jeff Stein, P.E., is a principal and Brandon Gill, P.E., is an associate at Taylor Engineering in Alameda, Calif. Stein is a member of ASHRAE SSPC 90.4, Energy Standard for Data Centers. Gill is a voting member of ASHRAE TC 1.4, Control Theory and Application.

©ASHRAE www.ashrae.org. Used with permission from ASHRAE Journal at www.taylor-engineering.com. This article may not be copied nor distributed in either paper or digital form without ASHRAE’s permission. For more information about ASHRAE, visit www.ashrae.org.

O CT O B E R 2 0 1 8 a s h r a e . o r g A S H R A E J O U R N A L 1 3

TECHNICAL FEATURE

Twenty years ago, the most common data center

cooling design included constant airflow, air-cooled,

direct-expansion, computer room air-conditioning

units (CRACs). One advantage of this design is that it

requires little if any centralized control; all CRACs oper-

ate independently.

Today, most data center cooling designs are far more

efficient and cost-effective but require some form of

centralized control to coordinate multiple computer

room air-handling units (CRAHs), fans, chillers, pumps,

etc. For example, supply fan speeds of multiple CRAHs

are typically controlled in unison to maintain a common

setpoint, such as underfloor pressure or cold to hot aisle

differential pressure (ΔP). The proportional–integral–

derivative (PID) loop maintaining ΔP at setpoint runs on

a single, central controller, which sends the speed com-

mand to all the CRAH units. The CRAH units may rely on

the central controller not only for fan speed command,

but also for start/stop command, supply air temperature

setpoint, outside air dry-bulb and wet-bulb temperature

(e.g., for economizer operation), etc.

The controls design must account for the potential

failure of the centralized controller, i.e. the loss of a

single controller should not result in the loss of data

center cooling. The basic options are redundant central

controllers or distributed/fail-safe controls. Often, both

of these strategies are employed within the same data

center. For instance, a central plant may be sequenced

by redundant central controllers while the air handlers

serving the data hall(s) may achieve fault responsive-

ness* using distributed fail-safe control. The following

sections present design considerations for each option.

Redundant Central ControllersRedundant central controllers are used in both direct

digital control (DDC) and programmable logic controller

(PLC) designs, but design considerations are unique to

each.

DDC With fully redundant DDC central controllers,

all central controller inputs/output (I/O) points (e.g.,

temperature sensor inputs, pump start/stop outputs)

are wired to a switchover relay panel. In normal opera-

tion the I/O are routed to the lead controller. The backup

controller monitors the “heartbeat” of the lead control-

ler, using a normally closed binary point that is held

open by the lead controller. If the lead controller fails

then the heartbeat fails, and the backup controller sends

a signal to the switchover panel that energizes the relays

and routes all I/O to the backup controller. The backup

controller picks up where the lead controller left off and

runs the exact same program as the lead controller. This

is illustrated in Figure 1.

Some risks of this approach:

• The programmers and operators must be very

diligent about exactly mirroring any programming or

setpoint changes in both controllers. If any changes are

made in the lead controllers and are not copied to the

backup controller, then the backup controller could im-

mediately shut off equipment that should be running.

• PID loops in the backup controller will wind up or

wind down over time even if loop gains are the same

as in the lead controller due to rounding errors. For

example, the lead controller’s fan speed PID loop might

be at 75% output but the same PID loop in the backup

controller may have wound down to 0% output. There-

fore a synchronization algorithm is required such that

when the backup controller takes control the output to

the controlled devices is slowly ramped between the last

known output from the lead controller and the current

output of the backup controller. Alternatively, loops on

the lead and backup controller can sometimes be con-

tinuously synchronized—e.g., by adjusting the PID loop

biases on the lag controller’s loops—to yield a seamless

transition.

FIGURE 1 Fully redundant DDC central plant controllers.

“A” Controller and Manager

Heartbeat “A”“B” Controller

Heartbeat “B”

To All Plant Devices (Chillers, Pumps, Sensors, etc.)

TB-1

TB-2

TB-2TB-2

TB-3

TB-3DO

DIDIDODO

*Fault responsiveness means the control system is designed to recognize any fault that can impact uptime (e.g., a chiller failure) and will automatically take necessary corrective action to maximize uptime (e.g., start another chiller).


• Switchover relays add cost and can fail. This risk can

be mitigated by having separate inputs hardwired to the

backup controller. In normal operation the lead control-

ler uses either only its sensors or both its sensors and the

backup controller’s sensors. When the lead controller

fails, the backup controller uses its dedicated sensors. A

general rule in HVAC controls design is that a control-

ler running a PID loop should have all process variables

(control point and controlled device) used in that loop

hardwired to the controller rather than communicated

over the network from another controller. This approach

prevents network traffic from affecting loop responsive-

ness. A PID loop using some points wired to the primary

controller and some wired to the backup controller

violates that rule but the risk can be mitigated by mini-

mizing network traffic and using a high-speed network

(e.g., Ethernet) between the affected controllers.

• Sensitive equipment, like chillers, can trip during

the short interval between when the lead controller fails

and the backup takes control.

PLC Most data centers use commercial grade DDC

controls. However, many data centers instead use PLCs,

which are considered “industrial grade” controls, and

are typically more expensive than DDC.

Figure 2 illustrates a simplified PLC network serving

a data center with four air handlers (AH-x) per data

hall and a plant with four chillers (CH-x), four chilled

water pumps (CHWP-x), four condenser water pumps

(CWP-x) and four cooling towers (CT-x). A pair of redun-

dant controllers serves the plant. Each air handler in the

data hall is served by its own controller, but there is a

pair of redundant central controllers (managers) coordi-

nating common control functions among the individual

AHU controllers.

A key difference between PLC and DDC is configura-

tion of redundant central controllers. PLCs avoid most of

the risks that DDC has with redundant central control-

lers. Rather than being wired directly to a controller, I/O

are wired to a remote I/O (RIO) panel. As illustrated in

Figure 2, the primary and redundant plant PLC control-

lers and the RIO reside on the same high-speed net-

work and both controllers can see the I/O so switchover

relays are not needed. Programming changes to the

primary controllers can be automatically duplicated in

the redundant controllers. The redundant controller

can monitor the PID loops of the primary controller and

launch its loops from the same position if the primary

controller fails and the redundant controller takes con-

trol. Although outside the scope of this article, Figure 2

also illustrates the concept of ring network topology in

both the data hall and plant, which allows failure of any

one network connection without a break in network

communications.

Even if a controls design includes redundant PLC

central controllers, it should still be resilient to cen-

tral control failures, such as a network failure or RIO

panel failure, i.e., distributed/fail-safe control should

always be considered, even with redundant PLC central

controllers.

Note that while PLC has a distinct advantage over DDC

where fully redundant central controllers are desired,

there are many other factors to consider in choosing a

control system. For example, in the authors’ experience,

the competency of the installing contractor is more

important than the product being installed.

Distributed/Fail-Safe Controls Often distributed/fail-safe control can be used

instead of, or in addition to, redundant central control.

Only fully redundant central control can guarantee

that all I/O will remain online and full automation of

all mechanical equipment will remain in the event

of a central controller failure. However, distributing

control functions to multiple controllers with fail-safe

logic can allow a data center to satisfactorily weather

a partial loss of I/O and partial loss of automation, at

least until operators can take any necessary manual

intervention. Distributed/fail-safe control can avoid

some of the cost and complexity of fully redundant

central control.

RIO RIO RIO RIO

PlantSwitch

Switch

Controller Controller Controller Controller

Redundant Controller 1

Redundant Controller 2

Redundant Manager 1

Redundant Manager 2

Data Hall

CH-1, CHWP-1, CWP-1, CT-1




AH-1 AH-2 AH-3 AH-4

FIGURE 2 Fully redundant PLC central plant and data hall controllers.

TECHNICAL FEATURE


Distributed/fail-safe control can be implemented with

DDC or PLC systems. When DDC is used, the authors

prefer distributed/fail-safe control over redundant cen-

tral control because it eliminates the complexity and

associated risks of redundant central controllers. In con-

trast to redundant central control, distributed/fail-safe

control does not involve transferring to a standby central

controller. Instead, it includes two strategies:

1. Devices are configured to go to a preset position,

stay in last position, or revert to local, standalone control

during a central controller failure. The fail-safe re-

sponses of all devices—variable speed drives, damper

and valve actuators, chillers, etc.—must be carefully

considered to provide a resilient response in any control

power, hardwired control signal, or network failure

event.

2. There are also scenarios where no one fail-safe

response always applies. In such cases, distributed

controllers should be used. Using a cooling tower as an

example, failing to a single speed in all circumstances

could lead to tower freezing in cold climates. So rather

than controlling all cooling tower speeds from a

central controller to maintain condenser water supply

temperature (CWST) at setpoint, each cooling tower

could have its own local controller and associated

CWST sensor instead. In normal operation, the central

controller would send tower start/stop and condenser

water setpoints to the local tower controllers. Each

tower’s leaving water temperature sensor would be

wired to its local controller and used by the tower

speed control PID loop in the local controller. If

communication with the central controller were lost,

the local controller would continue to modulate tower

speed to maintain a fail-safe tower leaving water

temperature setpoint. Should any one tower controller

fail, the associated tower could default to a disabled

state without creating a freeze risk during winter or

placing the plant at risk of losing the load.

Figure 3 shows the same data center from Figure 2, but

with a distributed control concept applied. In this con-

figuration, each air handler is still provided with its own

controller, but the redundant central data hall control-

lers (managers) from Figure 2 are eliminated. Similarly,

each plant “line up” of devices is provided with its own

controller. In both the data hall and plant, one of the

controllers acts as the central controller and coordinates

common functions across associated devices.

Figure 4 is a more economical variation on distrib-

uted control, which the authors have successfully

used many times, where the plant devices are split

among two controllers, rather than a separate con-

troller for each “line up” as shown in Figure 3. In this

case, Controller A acts as the central controller and is

responsible for staging of all chillers, all pump speeds,

etc. If A fails then all devices go to their fail-safe posi-

tions. Controller B does not take over from A but it can

stage on its devices based on its fail-safe logic and it

can be used by the operators to see the status of half of

the plant devices, to command those devices on, to see

plant leaving water temperature, etc.

One of the main advantages of distributed controllers

is that controller redundancy follows device redun-

dancy. For example, if there are N+1 cooling towers with

distributed controllers then there are N+1 distributed

controllers. The tower controller is probably more reli-

able than the devices it controls so no additional control-

ler redundancy is warranted. Similarly, an air handler

should have a dedicated controller and it is likely that

there are redundant air handlers so there is little benefit

of redundancy for distributed controllers.

FIGURE 3 Distributed plant and data hall control.

Plant

Switch

Controller Controller Controller

Controller and Manager

Data HallController and

Manager

Controller Controller Controller

AH-1 AH-2 AH-3 AH-4





Plant

Switch

Controller Controller ControllerController and Manager

Data Hall

CH-1, CH-2, CHWP-1, CHWP-2, CWP-1, CWP-2 CT-1, CT-2

AH-1 AH-2 AH-3 AH-4

“A” Controller and Manager “B” Controller

CH-3, CH-4, CHWP-3, CHWP-4, CWP-3, CWP-4 CT-3, CT-4

FIGURE 4 Distributed control using fewer plant controllers.

TECHNICAL FEATURE


The main downside of distributed/fail-safe control-

lers relative to fully redundant controllers is that full

automation is lost during a central controller failure

so there is some risk of not meeting the load if the load

changes or if there are more failures while in fail-safe

mode. However, IT loads typically do not change quickly

and data centers often are continuously monitored by

trained operators who can make necessary manual cor-

rections while in a fail-safe state. The most common

options for fail-safe state are failing to full cooling or

failing to last state. The pros and cons of these options

are discussed in more detail in the sections below on

VFDs, air handlers, chillers, etc.

Note that regardless of whether a device is controlled

by a central controller or a distributed controller it

should still be configured optimally for a loss of control

signal from its associated controller (or RIO panel). For

example, if cooling tower VFDs are controlled by local/

dedicated controllers as in the example above, then the

safest choice might be to have the VFD fail off if the con-

troller fails. The central controller will know that feed-

back from the VFD has been lost and therefore can stage

on another cooling tower.

The following sections describe how different types

of devices can be configured for distributed/fail-safe

control. It’s worth noting that many of these same device

considerations apply to PLC designs with redundant

central controllers to address the scenario of a failed RIO

panel.

VFDs Most VFD run commands can be configured to

FAIL ON, FAIL OFF or to FAIL LAST if they lose hardwired

communication with the controller providing the run

command. Each of these configurations is achieved by

matching relay configuration to internal drive configu-

ration. Fail ON means the VFD stays on if already run-

ning and automatically starts if it was not already run-

ning. FAIL LAST means it stays in the commanded state

just before communication failed. Similarly, most VFDs

speed commands can be configured to either hold their

current speed when the speed signal is lost or go to a

fail-safe speed. In practice, the drive recognizes a loss of

signal when the control input drops below a pre-defined

threshold, e.g., 1V for a 2 to 10V control signal input.

In the authors’ experience, FAIL LAST is a risky choice,

particularly for VFD speed because the VFD’s internal

controls do not always recognize the loss of signal quickly

enough. If the speed signal goes to zero before the VFD

recognizes the loss of communication then the VFD may

think minimum speed is the intended operating point.

Furthermore, commissioning FAIL LAST can lead to a

false sense of security because it may not be possible

to simulate all possible failure scenarios in commis-

sioning. For some devices it is possible to come up with

a safe fail-safe state and speed that is acceptable in all

conditions. For example, secondary chilled water pump

speed might normally be modulated to maintain chilled

water ΔP at setpoint. If the pumps failed ON and failed

to 100% speed, then the chilled water coil control valves

served by the pumps should be able to maintain control.

In some cases, there simply is not a fail-safe state or

speed that always works. Going back to the cooling tower

example, if a cooling tower VFD failed to 100% speed it

could cause ice to form on the tower in the winter.

ECMs Fail-safe functionality for electronically commu-

tated motors varies based on the onboard controller pro-

vided with the motor and must be carefully coordinated

with vendors. FAIL ON and FAIL OFF functionality can

always be provided through proper relay configuration,

e.g., wire the run command through a normally closed

relay contact to FAIL ON upon controller failure. One

manufacturer the authors have used provides a fail-safe

speed response upon loss of an analog speed control

signal like that of a VFD but does not offer the fail to last-

known speed functionality provided with typical VFDs.

Valve and Damper Actuators Actuators can be con-

figured to fail open or closed upon loss of command sig-

nal. Some servomotor actuators for large control valves

also allow for a preset fail-safe position upon loss of

control signal. Actuators can also be configured to hold

last-known position by using floating point control. In

effect, any desired command signal fail-safe response

can be achieved by specifying the appropriate actuator

and associated control wiring.

Actuators fail-safe positions must be coordinated with

fail-safe positions of related devices. For example, if

a tower VFD is configured to fail OFF when it loses its

speed signal, then the isolation valves serving that tower

should also be configured to fail closed upon loss of

command signal. Whether controlled by a central con-

troller, redundant central controllers, or a distributed

controller, every actuator should be configured for the

TECHNICAL FEATURE


that allow actuation to a fail-safe position between

full-open and full-closed upon loss of control power.

Both spring return and electronic fail-safe actuators

cost more than non-spring-return actuators and are

also a potential point of failure. For example, suppose

a tower isolation valve is spring-closed. If a tower is

in operation and loses power to the actuator then the

actuator will shut, taking that tower out of service,

even though its controller is still healthy. If the actua-

tor did not have spring-return then the tower would

have remained in service. In general, we recommend

avoiding spring return and electronic fail-safe actua-

tors in data center designs unless there is an “airtight”

reason to have them. A summary of the available com-

binations for loss of power and control signal fail-safe

responses is provided in Table 1. For example, if you

wanted a valve to fail open regardless of whether the

control signal failed or the control power failed then

it should be a spring return actuator configured to be

full open when the control signal is 0V and full closed

when the signal is 10V (the highlighted cell).

Air Handlers Air-handling units (AHUs) should

have a dedicated local controller. Like any distributed

controller, an AHU controller must be configured

to recognize when it loses communication with the

central controller (or the central controller fails) and

must control its devices accordingly in such a scenario.

A binary network heartbeat can be used to monitor

the health of the central controller and communica-

tion path. The AHU controller must have a plan for

each piece of information that it normally receives

from the manager such as AHU enable/disable, sup-

ply fan speed setpoint, outside air temperature, and

economizer enable/disable. For example, in normal

operation a central controller may monitor several

underfloor pressure sensors and modulate all AHU fan

speeds together to maintain the lowest valid sensor at

setpoint. If the AHU controller loses the fan speed set-

point, then it can hold the last-known speed setpoint

or it can modulate its own fan speed if each AHU has

an underfloor pressure sensor wired to it.

Just like a distributed controller must know what

to do if the central controller fails, so too the central

controller must know what to do if any distributed con-

troller fails or loses communication. For example, if a

central plant controller loses communication with an

air handler, it does not know if the air handler is still

functioning or has shut down. If the plant sequences

include chilled water supply temperature setpoint

reset or ΔP setpoint reset requests from the AHU, then

the safest approach is probably to assume the AHU is

still functioning and needs the coldest water and high-

est pressure possible.

Chillers Critical chiller points, such as start/stop, sta-

tus, and chilled water supply temperature (CHWST)

setpoint, should be hardwired points, rather than net-

work points. A hard-wired start/stop can be configured

to FAIL ON, FAIL OFF, or FAIL LAST on loss of external

command. FAIL ON may seem like the safest option but

suddenly bringing on chillers can trip chillers that are

already operating and make all chillers unstable, par-

ticularly if load is very low. FAIL LAST is typically the best

TABLE 1 Control valve fail-safe selection matrix.

Control Signal Failure

Fail Open Fail Closed Fail Last Fail-Safe Position

Control Power Failure

Fail Open

Spring Return Actuator Configured

Full Closed at 10V


Full Closed at 0V

Spring Return Actuator With Floating Point Control –

Fail Closed


Full Closed at 10V


Full Closed at 0V

Spring Return Actuator With Floating Point Control –

Fail LastNon-Spring Return

Actuator Configured Full Closed at 10V

Non-Spring Return Actuator Configured

Full Closed at 0V

Non-Spring Return Actuator With Floating

Point Control

Servomotor Actuator With

Onboard Controller

Fail-Safe Position

Electronic Fail-Safe Actuator Configured

Full Closed at 10V

Electronic Fail-Safe Actuator Configured

Full Closed at 0V

Electronic Fail-Safe Actuator With Floating

Point Control–

possibility that it loses a command

signal from its controller(s).

In addition to “loss of command

signal,” actuators also need to be

configured for “loss of control

power.” Basically, the choice here is

whether the actuator should have

spring-return or not. If the actuator

does not have spring-return, then

it will fail in place on loss of con-

trol power. If it has spring-return,

it can be configured to fail-open

or fail-closed on loss of power. A

final and less common option is to

specify electronic fail-safe actuators

TECHNICAL FEATURE


option but requires an additional controller digital out-

put and relay logic as illustrated in Figure 5.

Chillers can also typically be configured to maintain

a safe fail-safe CHWST setpoint on loss of external set-

point. Because chillers can be configured to hold last

known run command and default to a fail-safe CHWST

setpoint, dedicated DDC or PLC controllers per chiller

are not typically necessary, i.e., the chiller’s internal

controller is sufficiently fail-safe.

SensorsWhile redundant controllers can add unwarranted

complexity, redundant sensors add little complexity

and are highly recommended for critical control inputs.

Critical sensors include those used for control of central-

ized control variables such as chilled water temperature,

chilled water differential pressure, underfloor plenum

pressure, and space pressure, as well as some sensors

used for distributed/local control functions such as air

handler supply air temperature (SAT). Many engineers

assume that because AHUs are redundant that their sen-

sors do not need to be redundant, but a faulty SAT sensor

can be worse than a failed AHU because it can unknow-

ingly cause an AHU to provide unacceptably hot or cold

air to a critical IT space.

Most sensors are not critical sensors. An air handler

might have sensors for SAT, return air temperature, out-

side air temperature, mixed air temperature, damper

feedback, valve feedback, fan status, etc. The SAT is

“where the rubber hits the road” and is probably the

only sensor that deserves redundancy. Failure of other

sensors do not have the same risk. Failure of an OAT sen-

sor for example, might cost some extra chiller energy

by not economizing at the right time, but is unlikely to

prevent the AHU from achieving SAT setpoint. In fact,

having two SAT sensors is probably more important

than having any outside, return, or mixed air sensors

in an AHU. AHUs can share sensors with other AHUs or

do without them. For example, if the AHU uses direct

evaporative cooling without compressor cooling then

the only temperature sensor that the AHU needs is SAT.

See “Disparity Alarms” below for more discussion on

redundant sensors.

SequencesReliability can be improved by avoiding sequences

that rely on less reliable sensors, like humidity sensors

or water flow sensors. For example, suppose condenser

water pump speed needs to be high enough to maintain

at least the minimum flow rate required by the cool-

ing towers. A condenser water flow meter or ΔP sensors

across the chiller barrels could be used in the sequence.

However, in this case the most reliable option is to use

open loop control rather than closed loop control: dur-

ing commissioning have the balancer determine the

minimum pump speed for every combination of pumps,

towers and chillers and then include that minimum

speed in the control sequences. In general, the fewer

sensors that are required for normal operation, the

better.

Sequences should also make liberal use of “belts and

suspenders” to mitigate the risk of sensor error. A good

example is chiller staging. Adjusting the number of

enabled chillers based on measured chiller load is a good

way to stage chillers to maximize efficiency, but it should

not be the only way to stage chillers because the flow or

temperature sensors used to measure the load can fail

or go out of calibration. Therefore, chillers should also

be staged up on loss of chilled water setpoint, excessive

chiller water pressure drop, and high chiller percent

rated load amps.

Perhaps the most critical function of a data center

sequence of operations is recognizing and respond-

ing to failures. For example, when do you consider

a chiller to be in alarm and therefore shut it off and

start another chiller? A chiller plant will likely have

redundant chillers and may not be near design load

FIGURE 5 Chiller hold last command relay logic.

CH Disable

Enable

REF

DO

DO

R

R

R

L

CH Enable

CH-x

TECHNICAL FEATURE


so it makes sense to be conservative and consider a

chiller in alarm not only when the chiller sends an

alarm signal, but also when the chiller status is off

unexpectedly, when the leaving chilled water tem-

perature is too far above setpoint, etc. It gets trickier,

however, if there are no other chillers left to start that

are not already in alarm. In this extreme case, the

sequence should probably enable all chillers until

there are enough alarm-free chillers to meet the cur-

rent chiller stage. Of course, all scenarios should be

individually analyzed and solutions will be specific

to both the mechanical and electrical design for the

plant.

Sequences must also cover another critical concern:

restoring mechanical equipment as quickly as possible

when a site loses utility power and switches to generator

power.1

Sequences must also anticipate sensor and device fail-

ures. If a SAT sensor fails and there is a redundant one,

then it can be used. But what if the redundant sensor

also fails? Then should the chilled water valve or econo-

mizer dampers stay in last position, go to a fail-safe posi-

tion, or should the AHU shut down?

The risk of device failures can be mitigated by pro-

viding for rolling redundancy wherever possible in

sequences. Going back to the chiller staging example,

in a four chiller plant, load staging points for switch-

ing from two to three chillers and three to four chillers

would be chosen to maximize efficiency. The one to two

chiller staging point would however be set at the low-

est load that would allow two chillers to operate stably

without cycling. This approach mitigates the risk of

losing the load during a single chiller failure. A similar

strategy should be employed for both condenser water

and chilled water pump staging to minimize the risk of

chiller trips.

AlarmsAlarms are critical not only for alerting operators of

failures but also for identifying degradations in perfor-

mance and potential problems as early as possible.

Alarms should be tuned just like PID loops. Alarms

should trigger just outside of normal operation. For

example, a high SAT alarm should not be set at 2°F

(1.1°C) above SAT setpoint if SAT fluctuates ±3°F (1.7°C) of

setpoint in normal operation. Of course, setting alarms

too loosely can cost operators (and alarm response

sequences) precious time in responding to critical situ-

ations. Some tuning may be possible during pre-occu-

pancy commissioning but often it is not possible to prop-

erly tune alarms until post-occupancy commissioning.

Nuisance alarms are a common problem in some

data centers. When there are hundreds or thousands

of alarms every day the operators have little choice but

to assume that most or all alarms are nuisance alarms.

Some techniques to avoid nuisance alarms, besides

alarm tuning, are:

• Levels: All alarms should be classified into at least

three to five levels, such as fatal, critical, warning, notifi-

cation, maintenance reminder, etc.

• Entry Delays: All alarms should have an adjustable

delay time such that the alarm is not triggered unless the

alarm condition is true for the delay time.

• Exit Deadband: All alarms on analog inputs should

have an adjustable deadband or hysteresis—e.g., if the

SAT alarm is set at 85°F (29°C) for 5 minutes then the

alarm does not restore to normal until the SAT drops be-

low the alarm setpoint minus a deadband of 3°F (1.7°C)

for 5 minutes.

• Suppression Period: A particular instance of an

alarm should be prohibited from recurring for a defined

suppression period. For example, if communication

from a controller is fading in and out every few minutes,

then one alarm a day is sufficient rather than one alarm

every few minutes.

• Hierarchical and Maintenance Mode Suppression:

An alarm should be suppressed if it is associated with

a device that has been taken offline for maintenance,

or if its hierarchical alarm(s) is active—e.g., CRAH VFD

failure may be listed as the hierarchical alarm for CRAH

SAT alarm such that if the CRAH VFD failure is active

then the CRAH SAT alarm is suppressed. An alarm may

have several hierarchical alarms that suppress it. If an

alarm’s hierarchical alarm is suppressed then the alarm is

suppressed—e.g., loss of power would suppress CRAH VFD

failure, which would in turn suppress CRAH SAT alarm.

Disparity AlarmsDisparity alarms alert the operators of possible sensor

drift by comparing two or more measured or calculated

values that should closely agree. They are particularly

useful for sensors that are difficult to calibrate and keep

calibrated such as humidity sensors and air or water

flow sensors. Obviously redundant SAT sensors should

TECHNICAL FEATURE


always read about the same. Other disparity alarms may

be triggered by comparing:

• Chiller temperatures, flows, and pressure drops to

other chillers and to the plant total readings;

• Btu meters on both sides of waterside economizer

heat exchangers;

• Supply air dew point of active air handlers, when

the air handlers are not adding or removing humidity;

and

When three or more sensors feed into disparity alarm

logic, e.g., for site outside air temperature sensors, it is

possible to use disparity alarm logic to disqualify clearly

inaccurate sensors from control functions until released

by operators.

CommissioningProbably the two most critical commissioning tasks are

functional testing before occupancy and trend reviews

after occupancy. Functional testing should include sim-

ulating every possible failure scenario in the most real-

istic manner possible and confirming that the controls

react accordingly. Overriding the status feedback from a

fan, for example, is not as realistic as pulling the discon-

nect when the fan is running. Functional testing should

also include capacity testing equipment to ensure

devices and systems are able to perform at maximum

and minimum design capacity.

While failure scenarios can typically be simulated rea-

sonably well, it is often not possible to accurately simu-

late how smoothly the mechanical and controls design

will perform in normal operation when subjected to real

loads. Functional testing can confirm that sequences

are programmed correctly, but it typically cannot con-

firm that loops are tuned or identify flawed sequences

in need of minor or major revisions. Therefore, trends

must be reviewed after occupancy to confirm that all

systems are in fact operating per sequences, that all

control loops are properly tuned to avoid instability/

hunting, and that the alarms, sequences, and setpoints

are in fact the best ones possible. Often minor adjust-

ments to sequences and setpoints are needed to improve

stability and reliability and mitigate nuisance alarms.

Usually multiple rounds of trend reviews are required

to validate sequences that depend on different load and

weather conditions. Trend review includes reviewing

alarm logs to mitigate nuisance alarms as described

above.

If any significant changes are to be made to the con-

trols after occupancy, the revised sequences should

first be tested on a simulator before being installed

and functionally tested on the live system. Simulator

testing generally involves uploading the updated con-

trol program onto a controller in the contractor’s shop

and verifying that the expected controller outputs

are triggered in response to overridden sensor input

values.

Over the life of a data center, the load changes, the

mechanical equipment ages, sequences get tweaked,

setpoints get changed, nuisance alarms get disabled,

etc. Therefore, both functional testing and trend reviews

should be performed periodically throughout the life of

a data center to ensure the controls and equipment are

still sufficiently reliable at all times. For example, if the

controls design includes fully redundant DDC central

controllers, then it is important to regularly fail the lead

controller to verify the backup controller picks up where

the lead controller left off and runs the exact same pro-

gram as the lead controller.

ConclusionData center controls are inherently more complex

than the individual mechanical components they

serve. As such, careful design and commissioning of

controls is essential to ensure data center fault respon-

siveness. Fault responsive control system architecture

with well thought out device configuration is critical to

ensure proper controls response following the failure

of any control device. Smart sequences of operation

realize the redundancy afforded by well-designed

network architecture and device configuration.

Specification of redundant sensors for critical applica-

tions safeguards against one failed sensor torpedoing

the redundancy provided by smart sequences. Properly

specified and tuned alarms provide operators feed-

back they can rely on to identify either a real failure

or degraded system performance that will ultimately

cause one. Lastly, a thorough commissioning process

emphasizing functional testing and post-occupancy

trend review ensures that fault responsiveness

afforded by all aspects of the controls system design is

realized.

References1. Hydeman, M., R. Seidl, C. Shalley. 2005. “Staying on-line: data

center commissioning,” ASHRAE Journal (4).

TECHNICAL FEATURE

Documents

Reliaility · 12 ASHRAE JOURNAL ashraeor OCTOBER Data Center Controls Reliability BY JEFF STEIN, P.E., MEMBER ASHRAE; BRANDON GILL, P.E., MEMBER ASHRAE For high reliability data centers,