Upload
others
View
5
Download
1
Embed Size (px)
Citation preview
ReliabilityData Center Controls
Grid Coordination & Net Zero Energy Projects | The New Standard 90.2
Understanding Adaptation in the Built Environment | ASHRAE Research Report
OCTOBER 2018
A S H R A EJ O U R N A L
THE MAGAZINE OF HVAC&R TECHNOLOGY AND APPLICATIONS ASHRAE.ORG
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 81 2
Data Center Controls ReliabilityBY JEFF STEIN, P.E., MEMBER ASHRAE; BRANDON GILL, P.E., MEMBER ASHRAE
For high reliability data centers, there is general agreement in the design commu-nity on the need for redundancy for certain mechanical equipment (e.g., N+1 pumps, chillers, and cooling units). For other mechanical equipment (e.g., redundant piping), there is little agreement, but at least the options are fairly clear. However, there is far less agreement on and understanding of the redundancy requirements and options for the controls components used to monitor and control the mechanical systems.
Controls redundancy may be less visible and less well
understood than mechanical redundancy, but it is no
less important. In fact, in the authors’ experience, more
major data center cooling system failures are due to
poor controls design and implementation than are due
to mechanical equipment failures. Well-designed con-
trol systems must recognize and respond to the possible
failure or degradation of any device, including any con-
troller, sensor, actuator, variable frequency drive (VFD),
fan, pump, chiller, power source, electrical circuit, and
controls communication path. This article discusses
how to design and commission data center controls for
maximum reliability. Five key areas are addressed:
1. Control system architecture design and associated
controlled device configuration;
2. Redundant sensor requirements;
3. Fault responsive sequences of operation;
4. Alarming and notification requirements; and
5. Commissioning.
Controls System Architecture and Controlled Device Configuration
A good data center controls design does not necessar-
ily require controller redundancy. In fact, controller
redundancy can reduce reliability due to added com-
plexity and additional points of failure.
TECHNICAL FEATURE
Jeff Stein, P.E., is a principal and Brandon Gill, P.E., is an associate at Taylor Engineering in Alameda, Calif. Stein is a member of ASHRAE SSPC 90.4, Energy Standard for Data Centers. Gill is a voting member of ASHRAE TC 1.4, Control Theory and Application.
©ASHRAE www.ashrae.org. Used with permission from ASHRAE Journal at www.taylor-engineering.com. This article may not be copied nor distributed in either paper or digital form without ASHRAE’s permission. For more information about ASHRAE, visit www.ashrae.org.
O CT O B E R 2 0 1 8 a s h r a e . o r g A S H R A E J O U R N A L 1 3
TECHNICAL FEATURE
Twenty years ago, the most common data center
cooling design included constant airflow, air-cooled,
direct-expansion, computer room air-conditioning
units (CRACs). One advantage of this design is that it
requires little if any centralized control; all CRACs oper-
ate independently.
Today, most data center cooling designs are far more
efficient and cost-effective but require some form of
centralized control to coordinate multiple computer
room air-handling units (CRAHs), fans, chillers, pumps,
etc. For example, supply fan speeds of multiple CRAHs
are typically controlled in unison to maintain a common
setpoint, such as underfloor pressure or cold to hot aisle
differential pressure (ΔP). The proportional–integral–
derivative (PID) loop maintaining ΔP at setpoint runs on
a single, central controller, which sends the speed com-
mand to all the CRAH units. The CRAH units may rely on
the central controller not only for fan speed command,
but also for start/stop command, supply air temperature
setpoint, outside air dry-bulb and wet-bulb temperature
(e.g., for economizer operation), etc.
The controls design must account for the potential
failure of the centralized controller, i.e. the loss of a
single controller should not result in the loss of data
center cooling. The basic options are redundant central
controllers or distributed/fail-safe controls. Often, both
of these strategies are employed within the same data
center. For instance, a central plant may be sequenced
by redundant central controllers while the air handlers
serving the data hall(s) may achieve fault responsive-
ness* using distributed fail-safe control. The following
sections present design considerations for each option.
Redundant Central ControllersRedundant central controllers are used in both direct
digital control (DDC) and programmable logic controller
(PLC) designs, but design considerations are unique to
each.
DDC With fully redundant DDC central controllers,
all central controller inputs/output (I/O) points (e.g.,
temperature sensor inputs, pump start/stop outputs)
are wired to a switchover relay panel. In normal opera-
tion the I/O are routed to the lead controller. The backup
controller monitors the “heartbeat” of the lead control-
ler, using a normally closed binary point that is held
open by the lead controller. If the lead controller fails
then the heartbeat fails, and the backup controller sends
a signal to the switchover panel that energizes the relays
and routes all I/O to the backup controller. The backup
controller picks up where the lead controller left off and
runs the exact same program as the lead controller. This
is illustrated in Figure 1.
Some risks of this approach:
• The programmers and operators must be very
diligent about exactly mirroring any programming or
setpoint changes in both controllers. If any changes are
made in the lead controllers and are not copied to the
backup controller, then the backup controller could im-
mediately shut off equipment that should be running.
• PID loops in the backup controller will wind up or
wind down over time even if loop gains are the same
as in the lead controller due to rounding errors. For
example, the lead controller’s fan speed PID loop might
be at 75% output but the same PID loop in the backup
controller may have wound down to 0% output. There-
fore a synchronization algorithm is required such that
when the backup controller takes control the output to
the controlled devices is slowly ramped between the last
known output from the lead controller and the current
output of the backup controller. Alternatively, loops on
the lead and backup controller can sometimes be con-
tinuously synchronized—e.g., by adjusting the PID loop
biases on the lag controller’s loops—to yield a seamless
transition.
FIGURE 1 Fully redundant DDC central plant controllers.
“A” Controller and Manager
Heartbeat “A”“B” Controller
Heartbeat “B”
To All Plant Devices (Chillers, Pumps, Sensors, etc.)
TB-1
TB-2
TB-2TB-2
TB-3
TB-3DO
DIDIDODO
*Fault responsiveness means the control system is designed to recognize any fault that can impact uptime (e.g., a chiller failure) and will automatically take necessary corrective action to maximize uptime (e.g., start another chiller).
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 81 4
• Switchover relays add cost and can fail. This risk can
be mitigated by having separate inputs hardwired to the
backup controller. In normal operation the lead control-
ler uses either only its sensors or both its sensors and the
backup controller’s sensors. When the lead controller
fails, the backup controller uses its dedicated sensors. A
general rule in HVAC controls design is that a control-
ler running a PID loop should have all process variables
(control point and controlled device) used in that loop
hardwired to the controller rather than communicated
over the network from another controller. This approach
prevents network traffic from affecting loop responsive-
ness. A PID loop using some points wired to the primary
controller and some wired to the backup controller
violates that rule but the risk can be mitigated by mini-
mizing network traffic and using a high-speed network
(e.g., Ethernet) between the affected controllers.
• Sensitive equipment, like chillers, can trip during
the short interval between when the lead controller fails
and the backup takes control.
PLC Most data centers use commercial grade DDC
controls. However, many data centers instead use PLCs,
which are considered “industrial grade” controls, and
are typically more expensive than DDC.
Figure 2 illustrates a simplified PLC network serving
a data center with four air handlers (AH-x) per data
hall and a plant with four chillers (CH-x), four chilled
water pumps (CHWP-x), four condenser water pumps
(CWP-x) and four cooling towers (CT-x). A pair of redun-
dant controllers serves the plant. Each air handler in the
data hall is served by its own controller, but there is a
pair of redundant central controllers (managers) coordi-
nating common control functions among the individual
AHU controllers.
A key difference between PLC and DDC is configura-
tion of redundant central controllers. PLCs avoid most of
the risks that DDC has with redundant central control-
lers. Rather than being wired directly to a controller, I/O
are wired to a remote I/O (RIO) panel. As illustrated in
Figure 2, the primary and redundant plant PLC control-
lers and the RIO reside on the same high-speed net-
work and both controllers can see the I/O so switchover
relays are not needed. Programming changes to the
primary controllers can be automatically duplicated in
the redundant controllers. The redundant controller
can monitor the PID loops of the primary controller and
launch its loops from the same position if the primary
controller fails and the redundant controller takes con-
trol. Although outside the scope of this article, Figure 2
also illustrates the concept of ring network topology in
both the data hall and plant, which allows failure of any
one network connection without a break in network
communications.
Even if a controls design includes redundant PLC
central controllers, it should still be resilient to cen-
tral control failures, such as a network failure or RIO
panel failure, i.e., distributed/fail-safe control should
always be considered, even with redundant PLC central
controllers.
Note that while PLC has a distinct advantage over DDC
where fully redundant central controllers are desired,
there are many other factors to consider in choosing a
control system. For example, in the authors’ experience,
the competency of the installing contractor is more
important than the product being installed.
Distributed/Fail-Safe Controls Often distributed/fail-safe control can be used
instead of, or in addition to, redundant central control.
Only fully redundant central control can guarantee
that all I/O will remain online and full automation of
all mechanical equipment will remain in the event
of a central controller failure. However, distributing
control functions to multiple controllers with fail-safe
logic can allow a data center to satisfactorily weather
a partial loss of I/O and partial loss of automation, at
least until operators can take any necessary manual
intervention. Distributed/fail-safe control can avoid
some of the cost and complexity of fully redundant
central control.
RIO RIO RIO RIO
PlantSwitch
Switch
Controller Controller Controller Controller
Redundant Controller 1
Redundant Controller 2
Redundant Manager 1
Redundant Manager 2
Data Hall
CH-1, CHWP-1, CWP-1, CT-1
CH-2, CHWP-2, CWP-2, CT-2
CH-3, CHWP-3, CWP-3, CT-3
CH-4, CHWP-4, CWP-4, CT-4
AH-1 AH-2 AH-3 AH-4
FIGURE 2 Fully redundant PLC central plant and data hall controllers.
TECHNICAL FEATURE
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 81 6
Distributed/fail-safe control can be implemented with
DDC or PLC systems. When DDC is used, the authors
prefer distributed/fail-safe control over redundant cen-
tral control because it eliminates the complexity and
associated risks of redundant central controllers. In con-
trast to redundant central control, distributed/fail-safe
control does not involve transferring to a standby central
controller. Instead, it includes two strategies:
1. Devices are configured to go to a preset position,
stay in last position, or revert to local, standalone control
during a central controller failure. The fail-safe re-
sponses of all devices—variable speed drives, damper
and valve actuators, chillers, etc.—must be carefully
considered to provide a resilient response in any control
power, hardwired control signal, or network failure
event.
2. There are also scenarios where no one fail-safe
response always applies. In such cases, distributed
controllers should be used. Using a cooling tower as an
example, failing to a single speed in all circumstances
could lead to tower freezing in cold climates. So rather
than controlling all cooling tower speeds from a
central controller to maintain condenser water supply
temperature (CWST) at setpoint, each cooling tower
could have its own local controller and associated
CWST sensor instead. In normal operation, the central
controller would send tower start/stop and condenser
water setpoints to the local tower controllers. Each
tower’s leaving water temperature sensor would be
wired to its local controller and used by the tower
speed control PID loop in the local controller. If
communication with the central controller were lost,
the local controller would continue to modulate tower
speed to maintain a fail-safe tower leaving water
temperature setpoint. Should any one tower controller
fail, the associated tower could default to a disabled
state without creating a freeze risk during winter or
placing the plant at risk of losing the load.
Figure 3 shows the same data center from Figure 2, but
with a distributed control concept applied. In this con-
figuration, each air handler is still provided with its own
controller, but the redundant central data hall control-
lers (managers) from Figure 2 are eliminated. Similarly,
each plant “line up” of devices is provided with its own
controller. In both the data hall and plant, one of the
controllers acts as the central controller and coordinates
common functions across associated devices.
Figure 4 is a more economical variation on distrib-
uted control, which the authors have successfully
used many times, where the plant devices are split
among two controllers, rather than a separate con-
troller for each “line up” as shown in Figure 3. In this
case, Controller A acts as the central controller and is
responsible for staging of all chillers, all pump speeds,
etc. If A fails then all devices go to their fail-safe posi-
tions. Controller B does not take over from A but it can
stage on its devices based on its fail-safe logic and it
can be used by the operators to see the status of half of
the plant devices, to command those devices on, to see
plant leaving water temperature, etc.
One of the main advantages of distributed controllers
is that controller redundancy follows device redun-
dancy. For example, if there are N+1 cooling towers with
distributed controllers then there are N+1 distributed
controllers. The tower controller is probably more reli-
able than the devices it controls so no additional control-
ler redundancy is warranted. Similarly, an air handler
should have a dedicated controller and it is likely that
there are redundant air handlers so there is little benefit
of redundancy for distributed controllers.
FIGURE 3 Distributed plant and data hall control.
Plant
Switch
Controller Controller Controller
Controller and Manager
Data HallController and
Manager
Controller Controller Controller
AH-1 AH-2 AH-3 AH-4
CH-1, CHWP-1, CWP-1, CT-1
CH-2, CHWP-2, CWP-2, CT-2
CH-3, CHWP-3, CWP-3, CT-3
CH-4, CHWP-4, CWP-4, CT-4
Plant
Switch
Controller Controller ControllerController and Manager
Data Hall
CH-1, CH-2, CHWP-1, CHWP-2, CWP-1, CWP-2 CT-1, CT-2
AH-1 AH-2 AH-3 AH-4
“A” Controller and Manager “B” Controller
CH-3, CH-4, CHWP-3, CHWP-4, CWP-3, CWP-4 CT-3, CT-4
FIGURE 4 Distributed control using fewer plant controllers.
TECHNICAL FEATURE
O CT O B E R 2 0 1 8 a s h r a e . o r g A S H R A E J O U R N A L 1 7
The main downside of distributed/fail-safe control-
lers relative to fully redundant controllers is that full
automation is lost during a central controller failure
so there is some risk of not meeting the load if the load
changes or if there are more failures while in fail-safe
mode. However, IT loads typically do not change quickly
and data centers often are continuously monitored by
trained operators who can make necessary manual cor-
rections while in a fail-safe state. The most common
options for fail-safe state are failing to full cooling or
failing to last state. The pros and cons of these options
are discussed in more detail in the sections below on
VFDs, air handlers, chillers, etc.
Note that regardless of whether a device is controlled
by a central controller or a distributed controller it
should still be configured optimally for a loss of control
signal from its associated controller (or RIO panel). For
example, if cooling tower VFDs are controlled by local/
dedicated controllers as in the example above, then the
safest choice might be to have the VFD fail off if the con-
troller fails. The central controller will know that feed-
back from the VFD has been lost and therefore can stage
on another cooling tower.
The following sections describe how different types
of devices can be configured for distributed/fail-safe
control. It’s worth noting that many of these same device
considerations apply to PLC designs with redundant
central controllers to address the scenario of a failed RIO
panel.
VFDs Most VFD run commands can be configured to
FAIL ON, FAIL OFF or to FAIL LAST if they lose hardwired
communication with the controller providing the run
command. Each of these configurations is achieved by
matching relay configuration to internal drive configu-
ration. Fail ON means the VFD stays on if already run-
ning and automatically starts if it was not already run-
ning. FAIL LAST means it stays in the commanded state
just before communication failed. Similarly, most VFDs
speed commands can be configured to either hold their
current speed when the speed signal is lost or go to a
fail-safe speed. In practice, the drive recognizes a loss of
signal when the control input drops below a pre-defined
threshold, e.g., 1V for a 2 to 10V control signal input.
In the authors’ experience, FAIL LAST is a risky choice,
particularly for VFD speed because the VFD’s internal
controls do not always recognize the loss of signal quickly
enough. If the speed signal goes to zero before the VFD
recognizes the loss of communication then the VFD may
think minimum speed is the intended operating point.
Furthermore, commissioning FAIL LAST can lead to a
false sense of security because it may not be possible
to simulate all possible failure scenarios in commis-
sioning. For some devices it is possible to come up with
a safe fail-safe state and speed that is acceptable in all
conditions. For example, secondary chilled water pump
speed might normally be modulated to maintain chilled
water ΔP at setpoint. If the pumps failed ON and failed
to 100% speed, then the chilled water coil control valves
served by the pumps should be able to maintain control.
In some cases, there simply is not a fail-safe state or
speed that always works. Going back to the cooling tower
example, if a cooling tower VFD failed to 100% speed it
could cause ice to form on the tower in the winter.
ECMs Fail-safe functionality for electronically commu-
tated motors varies based on the onboard controller pro-
vided with the motor and must be carefully coordinated
with vendors. FAIL ON and FAIL OFF functionality can
always be provided through proper relay configuration,
e.g., wire the run command through a normally closed
relay contact to FAIL ON upon controller failure. One
manufacturer the authors have used provides a fail-safe
speed response upon loss of an analog speed control
signal like that of a VFD but does not offer the fail to last-
known speed functionality provided with typical VFDs.
Valve and Damper Actuators Actuators can be con-
figured to fail open or closed upon loss of command sig-
nal. Some servomotor actuators for large control valves
also allow for a preset fail-safe position upon loss of
control signal. Actuators can also be configured to hold
last-known position by using floating point control. In
effect, any desired command signal fail-safe response
can be achieved by specifying the appropriate actuator
and associated control wiring.
Actuators fail-safe positions must be coordinated with
fail-safe positions of related devices. For example, if
a tower VFD is configured to fail OFF when it loses its
speed signal, then the isolation valves serving that tower
should also be configured to fail closed upon loss of
command signal. Whether controlled by a central con-
troller, redundant central controllers, or a distributed
controller, every actuator should be configured for the
TECHNICAL FEATURE
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 81 8
that allow actuation to a fail-safe position between
full-open and full-closed upon loss of control power.
Both spring return and electronic fail-safe actuators
cost more than non-spring-return actuators and are
also a potential point of failure. For example, suppose
a tower isolation valve is spring-closed. If a tower is
in operation and loses power to the actuator then the
actuator will shut, taking that tower out of service,
even though its controller is still healthy. If the actua-
tor did not have spring-return then the tower would
have remained in service. In general, we recommend
avoiding spring return and electronic fail-safe actua-
tors in data center designs unless there is an “airtight”
reason to have them. A summary of the available com-
binations for loss of power and control signal fail-safe
responses is provided in Table 1. For example, if you
wanted a valve to fail open regardless of whether the
control signal failed or the control power failed then
it should be a spring return actuator configured to be
full open when the control signal is 0V and full closed
when the signal is 10V (the highlighted cell).
Air Handlers Air-handling units (AHUs) should
have a dedicated local controller. Like any distributed
controller, an AHU controller must be configured
to recognize when it loses communication with the
central controller (or the central controller fails) and
must control its devices accordingly in such a scenario.
A binary network heartbeat can be used to monitor
the health of the central controller and communica-
tion path. The AHU controller must have a plan for
each piece of information that it normally receives
from the manager such as AHU enable/disable, sup-
ply fan speed setpoint, outside air temperature, and
economizer enable/disable. For example, in normal
operation a central controller may monitor several
underfloor pressure sensors and modulate all AHU fan
speeds together to maintain the lowest valid sensor at
setpoint. If the AHU controller loses the fan speed set-
point, then it can hold the last-known speed setpoint
or it can modulate its own fan speed if each AHU has
an underfloor pressure sensor wired to it.
Just like a distributed controller must know what
to do if the central controller fails, so too the central
controller must know what to do if any distributed con-
troller fails or loses communication. For example, if a
central plant controller loses communication with an
air handler, it does not know if the air handler is still
functioning or has shut down. If the plant sequences
include chilled water supply temperature setpoint
reset or ΔP setpoint reset requests from the AHU, then
the safest approach is probably to assume the AHU is
still functioning and needs the coldest water and high-
est pressure possible.
Chillers Critical chiller points, such as start/stop, sta-
tus, and chilled water supply temperature (CHWST)
setpoint, should be hardwired points, rather than net-
work points. A hard-wired start/stop can be configured
to FAIL ON, FAIL OFF, or FAIL LAST on loss of external
command. FAIL ON may seem like the safest option but
suddenly bringing on chillers can trip chillers that are
already operating and make all chillers unstable, par-
ticularly if load is very low. FAIL LAST is typically the best
TABLE 1 Control valve fail-safe selection matrix.
Control Signal Failure
Fail Open Fail Closed Fail Last Fail-Safe Position
Control Power Failure
Fail Open
Spring Return Actuator Configured
Full Closed at 10V
Spring Return Actuator Configured
Full Closed at 0V
Spring Return Actuator With Floating Point Control –
Fail Closed
Spring Return Actuator Configured
Full Closed at 10V
Spring Return Actuator Configured
Full Closed at 0V
Spring Return Actuator With Floating Point Control –
Fail LastNon-Spring Return
Actuator Configured Full Closed at 10V
Non-Spring Return Actuator Configured
Full Closed at 0V
Non-Spring Return Actuator With Floating
Point Control
Servomotor Actuator With
Onboard Controller
Fail-Safe Position
Electronic Fail-Safe Actuator Configured
Full Closed at 10V
Electronic Fail-Safe Actuator Configured
Full Closed at 0V
Electronic Fail-Safe Actuator With Floating
Point Control–
possibility that it loses a command
signal from its controller(s).
In addition to “loss of command
signal,” actuators also need to be
configured for “loss of control
power.” Basically, the choice here is
whether the actuator should have
spring-return or not. If the actuator
does not have spring-return, then
it will fail in place on loss of con-
trol power. If it has spring-return,
it can be configured to fail-open
or fail-closed on loss of power. A
final and less common option is to
specify electronic fail-safe actuators
TECHNICAL FEATURE
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 82 0
option but requires an additional controller digital out-
put and relay logic as illustrated in Figure 5.
Chillers can also typically be configured to maintain
a safe fail-safe CHWST setpoint on loss of external set-
point. Because chillers can be configured to hold last
known run command and default to a fail-safe CHWST
setpoint, dedicated DDC or PLC controllers per chiller
are not typically necessary, i.e., the chiller’s internal
controller is sufficiently fail-safe.
SensorsWhile redundant controllers can add unwarranted
complexity, redundant sensors add little complexity
and are highly recommended for critical control inputs.
Critical sensors include those used for control of central-
ized control variables such as chilled water temperature,
chilled water differential pressure, underfloor plenum
pressure, and space pressure, as well as some sensors
used for distributed/local control functions such as air
handler supply air temperature (SAT). Many engineers
assume that because AHUs are redundant that their sen-
sors do not need to be redundant, but a faulty SAT sensor
can be worse than a failed AHU because it can unknow-
ingly cause an AHU to provide unacceptably hot or cold
air to a critical IT space.
Most sensors are not critical sensors. An air handler
might have sensors for SAT, return air temperature, out-
side air temperature, mixed air temperature, damper
feedback, valve feedback, fan status, etc. The SAT is
“where the rubber hits the road” and is probably the
only sensor that deserves redundancy. Failure of other
sensors do not have the same risk. Failure of an OAT sen-
sor for example, might cost some extra chiller energy
by not economizing at the right time, but is unlikely to
prevent the AHU from achieving SAT setpoint. In fact,
having two SAT sensors is probably more important
than having any outside, return, or mixed air sensors
in an AHU. AHUs can share sensors with other AHUs or
do without them. For example, if the AHU uses direct
evaporative cooling without compressor cooling then
the only temperature sensor that the AHU needs is SAT.
See “Disparity Alarms” below for more discussion on
redundant sensors.
SequencesReliability can be improved by avoiding sequences
that rely on less reliable sensors, like humidity sensors
or water flow sensors. For example, suppose condenser
water pump speed needs to be high enough to maintain
at least the minimum flow rate required by the cool-
ing towers. A condenser water flow meter or ΔP sensors
across the chiller barrels could be used in the sequence.
However, in this case the most reliable option is to use
open loop control rather than closed loop control: dur-
ing commissioning have the balancer determine the
minimum pump speed for every combination of pumps,
towers and chillers and then include that minimum
speed in the control sequences. In general, the fewer
sensors that are required for normal operation, the
better.
Sequences should also make liberal use of “belts and
suspenders” to mitigate the risk of sensor error. A good
example is chiller staging. Adjusting the number of
enabled chillers based on measured chiller load is a good
way to stage chillers to maximize efficiency, but it should
not be the only way to stage chillers because the flow or
temperature sensors used to measure the load can fail
or go out of calibration. Therefore, chillers should also
be staged up on loss of chilled water setpoint, excessive
chiller water pressure drop, and high chiller percent
rated load amps.
Perhaps the most critical function of a data center
sequence of operations is recognizing and respond-
ing to failures. For example, when do you consider
a chiller to be in alarm and therefore shut it off and
start another chiller? A chiller plant will likely have
redundant chillers and may not be near design load
FIGURE 5 Chiller hold last command relay logic.
CH Disable
Enable
REF
DO
DO
R
R
R
L
CH Enable
CH-x
TECHNICAL FEATURE
O CT O B E R 2 0 1 8 a s h r a e . o r g A S H R A E J O U R N A L 2 1
so it makes sense to be conservative and consider a
chiller in alarm not only when the chiller sends an
alarm signal, but also when the chiller status is off
unexpectedly, when the leaving chilled water tem-
perature is too far above setpoint, etc. It gets trickier,
however, if there are no other chillers left to start that
are not already in alarm. In this extreme case, the
sequence should probably enable all chillers until
there are enough alarm-free chillers to meet the cur-
rent chiller stage. Of course, all scenarios should be
individually analyzed and solutions will be specific
to both the mechanical and electrical design for the
plant.
Sequences must also cover another critical concern:
restoring mechanical equipment as quickly as possible
when a site loses utility power and switches to generator
power.1
Sequences must also anticipate sensor and device fail-
ures. If a SAT sensor fails and there is a redundant one,
then it can be used. But what if the redundant sensor
also fails? Then should the chilled water valve or econo-
mizer dampers stay in last position, go to a fail-safe posi-
tion, or should the AHU shut down?
The risk of device failures can be mitigated by pro-
viding for rolling redundancy wherever possible in
sequences. Going back to the chiller staging example,
in a four chiller plant, load staging points for switch-
ing from two to three chillers and three to four chillers
would be chosen to maximize efficiency. The one to two
chiller staging point would however be set at the low-
est load that would allow two chillers to operate stably
without cycling. This approach mitigates the risk of
losing the load during a single chiller failure. A similar
strategy should be employed for both condenser water
and chilled water pump staging to minimize the risk of
chiller trips.
AlarmsAlarms are critical not only for alerting operators of
failures but also for identifying degradations in perfor-
mance and potential problems as early as possible.
Alarms should be tuned just like PID loops. Alarms
should trigger just outside of normal operation. For
example, a high SAT alarm should not be set at 2°F
(1.1°C) above SAT setpoint if SAT fluctuates ±3°F (1.7°C) of
setpoint in normal operation. Of course, setting alarms
too loosely can cost operators (and alarm response
sequences) precious time in responding to critical situ-
ations. Some tuning may be possible during pre-occu-
pancy commissioning but often it is not possible to prop-
erly tune alarms until post-occupancy commissioning.
Nuisance alarms are a common problem in some
data centers. When there are hundreds or thousands
of alarms every day the operators have little choice but
to assume that most or all alarms are nuisance alarms.
Some techniques to avoid nuisance alarms, besides
alarm tuning, are:
• Levels: All alarms should be classified into at least
three to five levels, such as fatal, critical, warning, notifi-
cation, maintenance reminder, etc.
• Entry Delays: All alarms should have an adjustable
delay time such that the alarm is not triggered unless the
alarm condition is true for the delay time.
• Exit Deadband: All alarms on analog inputs should
have an adjustable deadband or hysteresis—e.g., if the
SAT alarm is set at 85°F (29°C) for 5 minutes then the
alarm does not restore to normal until the SAT drops be-
low the alarm setpoint minus a deadband of 3°F (1.7°C)
for 5 minutes.
• Suppression Period: A particular instance of an
alarm should be prohibited from recurring for a defined
suppression period. For example, if communication
from a controller is fading in and out every few minutes,
then one alarm a day is sufficient rather than one alarm
every few minutes.
• Hierarchical and Maintenance Mode Suppression:
An alarm should be suppressed if it is associated with
a device that has been taken offline for maintenance,
or if its hierarchical alarm(s) is active—e.g., CRAH VFD
failure may be listed as the hierarchical alarm for CRAH
SAT alarm such that if the CRAH VFD failure is active
then the CRAH SAT alarm is suppressed. An alarm may
have several hierarchical alarms that suppress it. If an
alarm’s hierarchical alarm is suppressed then the alarm is
suppressed—e.g., loss of power would suppress CRAH VFD
failure, which would in turn suppress CRAH SAT alarm.
Disparity AlarmsDisparity alarms alert the operators of possible sensor
drift by comparing two or more measured or calculated
values that should closely agree. They are particularly
useful for sensors that are difficult to calibrate and keep
calibrated such as humidity sensors and air or water
flow sensors. Obviously redundant SAT sensors should
TECHNICAL FEATURE
A S H R A E J O U R N A L a s h r a e . o r g O CT O B E R 2 0 1 82 2
always read about the same. Other disparity alarms may
be triggered by comparing:
• Chiller temperatures, flows, and pressure drops to
other chillers and to the plant total readings;
• Btu meters on both sides of waterside economizer
heat exchangers;
• Supply air dew point of active air handlers, when
the air handlers are not adding or removing humidity;
and
When three or more sensors feed into disparity alarm
logic, e.g., for site outside air temperature sensors, it is
possible to use disparity alarm logic to disqualify clearly
inaccurate sensors from control functions until released
by operators.
CommissioningProbably the two most critical commissioning tasks are
functional testing before occupancy and trend reviews
after occupancy. Functional testing should include sim-
ulating every possible failure scenario in the most real-
istic manner possible and confirming that the controls
react accordingly. Overriding the status feedback from a
fan, for example, is not as realistic as pulling the discon-
nect when the fan is running. Functional testing should
also include capacity testing equipment to ensure
devices and systems are able to perform at maximum
and minimum design capacity.
While failure scenarios can typically be simulated rea-
sonably well, it is often not possible to accurately simu-
late how smoothly the mechanical and controls design
will perform in normal operation when subjected to real
loads. Functional testing can confirm that sequences
are programmed correctly, but it typically cannot con-
firm that loops are tuned or identify flawed sequences
in need of minor or major revisions. Therefore, trends
must be reviewed after occupancy to confirm that all
systems are in fact operating per sequences, that all
control loops are properly tuned to avoid instability/
hunting, and that the alarms, sequences, and setpoints
are in fact the best ones possible. Often minor adjust-
ments to sequences and setpoints are needed to improve
stability and reliability and mitigate nuisance alarms.
Usually multiple rounds of trend reviews are required
to validate sequences that depend on different load and
weather conditions. Trend review includes reviewing
alarm logs to mitigate nuisance alarms as described
above.
If any significant changes are to be made to the con-
trols after occupancy, the revised sequences should
first be tested on a simulator before being installed
and functionally tested on the live system. Simulator
testing generally involves uploading the updated con-
trol program onto a controller in the contractor’s shop
and verifying that the expected controller outputs
are triggered in response to overridden sensor input
values.
Over the life of a data center, the load changes, the
mechanical equipment ages, sequences get tweaked,
setpoints get changed, nuisance alarms get disabled,
etc. Therefore, both functional testing and trend reviews
should be performed periodically throughout the life of
a data center to ensure the controls and equipment are
still sufficiently reliable at all times. For example, if the
controls design includes fully redundant DDC central
controllers, then it is important to regularly fail the lead
controller to verify the backup controller picks up where
the lead controller left off and runs the exact same pro-
gram as the lead controller.
ConclusionData center controls are inherently more complex
than the individual mechanical components they
serve. As such, careful design and commissioning of
controls is essential to ensure data center fault respon-
siveness. Fault responsive control system architecture
with well thought out device configuration is critical to
ensure proper controls response following the failure
of any control device. Smart sequences of operation
realize the redundancy afforded by well-designed
network architecture and device configuration.
Specification of redundant sensors for critical applica-
tions safeguards against one failed sensor torpedoing
the redundancy provided by smart sequences. Properly
specified and tuned alarms provide operators feed-
back they can rely on to identify either a real failure
or degraded system performance that will ultimately
cause one. Lastly, a thorough commissioning process
emphasizing functional testing and post-occupancy
trend review ensures that fault responsiveness
afforded by all aspects of the controls system design is
realized.
References1. Hydeman, M., R. Seidl, C. Shalley. 2005. “Staying on-line: data
center commissioning,” ASHRAE Journal (4).
TECHNICAL FEATURE