White Paper
Nondisruptive Operations and
Clustered Data ONTAP Jay Bounds, Arun Raman, NetApp
March 2015 | WP-7195
2 Nondisruptive Operations and Clustered Data ONTAP White Paper
TABLE OF CONTENTS
1 Introduction ........................................................................................................................................... 3
2 Nondisruptive Operations (NDO) and Clustered Data ONTAP ........................................................ 3
2.1 Planned Events ...............................................................................................................................................4
2.2 Unplanned Events ...........................................................................................................................................7
3 Protocols and NDO ............................................................................................................................... 8
3.1 Windows File Services (WFS) .........................................................................................................................8
3.2 Network File Services (NFS) ...........................................................................................................................8
3.3 Storage Area Network (SAN) ..........................................................................................................................8
3.4 Metro Cluster (MCC) .......................................................................................................................................8
4 Cost Savings from NDO ....................................................................................................................... 9
4.1 Opportunity for Reinvestment (Opex Savings) ................................................................................................9
4.2 Real Direct Savings (Capex Savings) .............................................................................................................9
4.3 Cost Avoidance Savings .................................................................................................................................9
4.4 Case Study .....................................................................................................................................................9
5 Conclusion .......................................................................................................................................... 11
LIST OF TABLES
Table 1) Technology Refresh use cases and benefits. ...................................................................................................4
Table 2) Maintenance operation use cases and benefits. ..............................................................................................6
Table 3) Protection against unplanned events. ..............................................................................................................7
Table 4) NDO support matrix for different operations. ....................................................................................................8
Table 5) Annual Savings .............................................................................................................................................. 10
LIST OF FIGURES
Figure 1) NDO and clustered Data ONTAP. ...................................................................................................................3
3 Nondisruptive Operations and Clustered Data ONTAP White Paper
1 Introduction
As IT operations become more and more critical to business, no downtime can be tolerated. Downtime
causes lost business, poor customer satisfaction, and competitive weakness. Today’s “always-on” data
centers are all about how the underlying infrastructure provides continuous data availability to the
applications despite changing business needs. From a storage perspective, this means a flexible storage
infrastructure, which allows businesses to start small and grow nondisruptively as the business grows.
This flexibility reduces up-front costs and enables businesses to build their data infrastructure with no
disruption to users or applications.
2 Nondisruptive Operations (NDO) and Clustered Data ONTAP
NDO is not merely a feature or an attribute of clustered NetApp® Data ONTAP
®; it’s an architectural vision
for eliminating all forms of downtime from storage environments. Figure 1 classifies operations and
events in the data center that typically incur downtime. This paper describes how clustered Data ONTAP
can end downtime resulting from these events.
Figure 1) NDO and clustered Data ONTAP
Lifecycle operations are the operations that a customer performs to optimize the storage environment to
meet business SLAs while maintaining a cost-optimized solution. These operations include moving
datasets around the cluster to different tiers of storage and storage controllers to optimize the
performance level of the dataset and to manage capacity allocations for the future growth of the dataset.
Maintenance operations are operations during which components of the storage subsystem can be
maintained or upgraded without incurring any data outage; for example, replacing a single hardware
component such as a disk drive or shelf fan; or even replacing a complete controller, shelf, or system.
The idea is that although data potentially lives forever, hardware does not, so maintenance and
replacement of hardware will happen one or more times over the lifetime of a dataset.
Infrastructure resiliency is the base building block for the storage subsystem; it prevents a customer
from having an unplanned outage when a hardware or software failure occurs. Infrastructure resiliency is
made up of redundant FRU components, multipath HA controller configurations, RAID, and NetApp
WAFL® (Write Anywhere File Layout) software enhancements that help prevent failures from a software
perspective. For node hardware failures or software failures, HA failover allows a node in an HA pair to
fail over to its partner.
4 Nondisruptive Operations and Clustered Data ONTAP White Paper
At a high level, there are two types of downtime – unplanned and planned. Unplanned downtime is
usually the result of component failure in the system, site failure, or human error; planned downtime is an
activity that can be scheduled or planned.
2.1 Planned Events
As critical as it is to protect against unplanned downtime, the most common form of downtime is the
planned variety. By some estimates, 80% to 90% of all downtime in data centers is planned. Planned
downtime falls into two categories – maintenance operations (capacity and performance balancing) and
lifecycle operations (technology refresh).
Capacity and Performance Balancing
Because of the difficulty of predicting future capacity and performance usage of applications and
changing business needs, it is quite common for capacity or performance hot spots to emerge in a
storage cluster. Usually, the organization’s response to this unpredictability is to significantly
overprovision storage, leading to wasteful underutilization of storage resources. Even when data
migration is necessary, it usually requires application-level downtime. Given the frequency of capacity
and performance bottleneck events (often weekly events in many customer environments), NetApp
internal research suggests that nearly 40% of all benefits of NDO result from improving utilization from
capacity or performance rebalancing.
Technology Refresh
Most hardware assets, including storage arrays, are on a 3-to-4-year depreciation cycle and are replaced
by the latest gear available. This process usually involves copying existing data and migrating existing
applications to the new array. Tech refresh is a top-of-mind storage initiative for administrators for multiple
years running (source: The InfoPro Group). According to the same source, nearly 70% of all data
migrations are done for tech refresh.
Table 1) Technology Refresh use cases and benefits
Use Case Benefits
Capacity and Performance Balancing
Clustered Data ONTAP, with its nondisruptive data migration capabilities, such as LIF migrate , NetApp DataMotion™ for Volumes (VolMove), and DataMotion™ for Luns (LunMove), allows customers to remediate capacity or performance bottlenecks by moving them to another part of the cluster that has excess capacity or performance, thus avoiding application outage and reconfiguration and also leading to an overall dramatically higher level of storage utilization.
Learn more…
Saves administrators’ time by eliminating the need for
Negotiating downtime with application administrators
Shutting down multiple apps
Restarting multiple apps
Verifying that everything is working correctly
Accelerates the use of new hardware sooner: Tech refreshes are done in hours compared to weeks
5 Nondisruptive Operations and Clustered Data ONTAP White Paper
Storage Tiering
Data storage volumes and luns are a dynamic resource for workloads. The required priority of any given volume or lun is a function of the workload’s SLA requirements and the capacity growth expectations. DataMotion for Volumes and DataMotion for Luns provides transparent movement of volumes and/or luns to alternate resources within the cluster to accommodate the resource needs of the workload on the volume.
Learn more…
or months.
Scale-Out
DataMotion for Volumes allows customers to scale a storage system as needed by adding either new nodes or additional disk shelves and relocating data volumes to new hardware to balance provisioned resource pools on demand.
Hardware Tech Refresh
Clustered Data ONTAP offers two solutions for doing tech refresh:
Aggregate relocation (ARL) capability makes it possible to replace NetApp storage controllers nondisruptively without the need to copy data.
Clustered Data ONTAP nondisruptive DataMotion for Volumes can be used to migrate existing data to new arrays.
Learn more…
Maintenance Operations
Maintenance operations are typically done by using the built-in functionality of the storage array. They
include operations such as software and firmware upgrades, as well as replacing broken components
nondisruptively.
Software and Firmware Upgrades
Usually, each storage array has multiple types of firmware, such as disk firmware and shelf firmware. As
new software revisions are released, the ability to upgrade all of these nondisruptively is essential.
Clustered Data ONTAP has the ability to upgrade Data ONTAP software as well as the disk and shelf
firmware nondisruptively.
Data ONTAP 8.3 introduces the Automated Nondisruptive Upgrade (ANDU) method which simplifies the
Nondisruptive Upgrade(NDU) process by performing ANDU within the 8.3 release family (minor NDU
6 Nondisruptive Operations and Clustered Data ONTAP White Paper
only). The new cluster image update command installs the target Data ONTAP image on each node
,validates the cluster components and nondisruptively executes upgrade.
Break, Fix, and Replace Operations
Typically, when a component fails, a redundant component takes over to maintain high availability.
However, at this point the storage array is running in degraded mode, leaving it susceptible to a
secondary failure leading to outage. Therefore the ability to replace broken components nondisruptively is
critical to maintaining redundancy and full availability. Clustered Data ONTAP makes it possible to
nondisruptively replace storage controllers, subcomponents inside the controllers (such as HBAs, NICs,
motherboard, NVRAM, and so on), switches, cables, and disks.
Table 2) Maintenance operation use cases and benefits
Use Case Benefits
Software Upgrade
HA architecture offers easier software upgrade without the need for downtime
Disk firmware, disk shelf firmware, and so on can be replaced nondisruptively.
Learn more…
Eliminate deferred software and firmware upgrades
Be on the latest software faster and start using the latest features sooner
Saves administrators’ time by eliminating the need for
Negotiating downtime with application administrators
Shutting down multiple apps
Restarting multiple apps
Verifying that everything is working correctly
Eliminate deferred maintenance and spend less time in degraded mode
Eliminate deferred maintenance and apply changes immediately
Replace failed component
A failed component is taken over by a redundant component to maintain high availability.
Failed components can be replaced nondisruptively, maintaining redundancy and full availability.
Change configuration
Additional components can be added or upgraded to improve performance or add functionality. For example, NetApp Flash Cache™ can be added to a controller nondisruptively without the need for any application downtime.
7 Nondisruptive Operations and Clustered Data ONTAP White Paper
Nondisruptive shelf removal allows old disk shelves to be retired without the need to bring down the system.
2.2 Unplanned Events
Historically, most storage (and even server or application) systems were designed to protect against
unplanned downtime by eliminating all single points of failure and building redundancy at all levels.
NetApp storage systems are no exception – from redundant hardware components inside the storage
controller to a dual controller HA architecture. Clustered Data ONTAP provides proven five 9’s availability,
with less than 5 minutes of downtime per year.
Table 3) Protection against unplanned events
Resiliency Benefits
Storage Failures
High availability (HA) storage controller pairs, multipath HA , and alternate control path (ACP) eliminate any single points of failure, from host to data path.
NetApp RAID-DP®, ACP,
persistent NVRAM write logs, service processor protection, and dual power supplies provide hardware availability. Learn more…
Greater than five 9s availability
No single point of failure
Disk Failures
NetApp RAID-DP is an advanced RAID 6 technology. At a cost equal to or less than that of RAID 5 and with excellent performance, RAID-DP technology offers double-parity protection against data loss. In other words, RAID-DP enables “peace of mind” enterprise storage without compromise.
Learn more…
Path failures
Clustered Data ONTAP implements industry-standard Asymmetric Logical Unit Access (ALUA) to manage multiple I/O from host to storage and to handle path failures.
Learn more…
8 Nondisruptive Operations and Clustered Data ONTAP White Paper
3 Protocols and NDO
Nondisruptive operations are perceived differently by the client and by the application. Most protocols
have built-in ability to handle some disruptions caused by data migration and switch-over, and to hide
disruption from applications and users. Clustered Data ONTAP supports all common protocols (NFS,
CIFS, FC, FCoE, and iSCSI), and enhancements are made at both the protocol layer and the Data
ONTAP layer to make it nondisruptive.
3.1 Windows File Services (WFS)
SMB 3.0 for Microsoft® Hyper-V
® and SQL Server
® environments with continuously available file shares
support complete nondisruptive operations for all scenarios, including LIF migration, volume move, and
storage failover and giveback. However due to SMB 1.0 and 2.0 protocol shortcomings, the client
connections get dropped during LIF migration or storage failover, but they are nondisruptive for volume
move operations. SMB 2.0 introduces durable file handles, which allow an application to use the existing
open file handles after a LIF migration. In summary, the minimum requirement for NDO is SMB 2.0 or
later. Learn more…
3.2 Network File Services (NFS)
NFS clients normally retry to connect to the storage. If the timeouts are configured correctly for the
protocol and at the application level, NFS clients preserve nondisruptive operations.
3.3 Storage Area Network (SAN)
Clustered Data ONTAP asymmetric active-active architecture is based on the ALUA standard. ALUA
enables host multipath software to distinguish between optimized and unoptimized paths and to switch
over to unoptimized active paths when the current active path is down due to failure or controller
maintenance.
Table 4) NDO support matrix for different operations
Protocol LIF Migration Volume Move LUN Move Failover and
Giveback
Aggregate Relocation
(ARL)
WFS * ** **
NFS
SAN
* Only SMB 2.0 and later are nondisruptive. ** Only SMB 3.0 with continually available file shares is nondisruptive.
3.4 MetroCluster (MCC)
MetroCluster extends the nondisruptive operations of clustered Data ONTAP, providing protocol-
transparent site-level recovery. An HA Cluster at each site can service independent workloads and
provide bidirectional switchover capability in the event of a planned operation or site-wide disaster. The
result is zero RPO and near zero RTO for the hosted applications.
MetroCluster provides both automatic local failover (HA) capability as well as simple, one-command site-
wide switchover. Automatic synchronous propagates all data and configuration changes to both sites.
Applications, hosts, and clients can integrate easily into MetroCluster and there is little to no ongoing
administrative effort required. Other clustered ONTAP features such as storage efficiency, Quality of
Service (QoS) and SnapMirror/SnapVault data protection are seamlessly integrated into MetroCluster.
Monitoring and alerting is provided through OnCommand Unified Manager.
9 Nondisruptive Operations and Clustered Data ONTAP White Paper
4 Cost Savings from NDO
Cost savings related to downtime have typically focused heavily on eliminating unplanned downtime – by
demonstrating savings related to avoiding lost employee productivity, lost revenue from unserved
customers, loss of customers to competitors, damaged reputation, and so on. Although these savings are
real and quantifiable, the impact of eliminating planned downtime is far more substantial: Based on our
research, 70% to 80% of NDO benefits are achieved by enabling the elimination of planned downtime.
These cost savings from eliminating planned downtime with clustered ONTAP fall into three categories:
opex savings from improving administrator productivity, capex savings from increased utilization, and
reduced risk from proactive maintenance and upgrades.
4.1 Opportunity for Reinvestment (Opex Savings)
For planned downtime, the argument could be made that it can be scheduled at a relatively low-activity
time of the day, week, or quarter, thus minimizing the impact to the organization and the end user.
However, in a shared storage environment, coordinating these low-activity periods among multiple
stakeholders is a difficult task. The actual outage requires effort from multiple administrators to shut down,
verify, start up, verify, and so on. It may not even be possible to negotiate a suitable window for the
downtime.
4.2 Real Direct Savings (Capex Savings)
There are two sources of capex savings. First, nondisruptive data migration to remediate application
capacity and performance bottlenecks allows customers to dramatically increase the overall utilization of
their storage systems.
The second source of capex savings is tech refresh. Without NDO and clustered Data ONTAP, newly
purchased storage frequently sits boxed up, waiting for a window for downtime to install the new
equipment, or for disruptive data migrations to be completed, while the clock has already started ticking
on monthly maintenance charges and depreciation.
4.3 Cost Avoidance Savings
The third source of NDO savings from eliminating planned downtime is reduction in risk. Being able to
perform maintenance operations such as software and firmware upgrades, as well as replacing broken
components proactively, significantly reduces the risk of unplanned downtime resulting from a secondary
failure. Without NDO, these upgrades are often delayed for lack of a downtime maintenance window,
making the storage system vulnerable to issues with known fixes. Similarly, the lack of maintenance
windows forces customers to continue to run in degraded mode resulting from the failure of the primary
component.
Our research suggests that more than 20% of all savings from NDO are achieved by reducing the risk
that results from deferred maintenance.
4.4 Case Study
This case study examines the IT department of a large technology company. The company has 145
NetApp HA installations supported by 21 storage administrators and 21 application administrators. With
145 solutions, at least one maintenance activity is performed on an HA pair every week. Storage
administrators must constantly negotiate with application administrators to get a common maintenance
window for any planned activity, thus bringing down their productivity.
10 Nondisruptive Operations and Clustered Data ONTAP White Paper
Although the company’s infrastructure met most of their business needs, the IT team wanted to adopt
new technology that would serve the business needs much faster. The company’s IT department focused
on building a cost-competitive solution that meets business needs and also improves productivity of the
administrators and increases operations margins.
The IT department chose clustered Data ONTAP and immediately saw their administrators spending less
time in negotiating for downtime. Although the maintenance operations continued to be performed during
times of less activity, the hours of negotiation and the need to stop, start, and verify applications were
completely eliminated. Table 5 shows the annual savings that the company realized in terms of
administrator hours, capex expenditures, and risk avoidance.
11 Nondisruptive Operations and Clustered Data ONTAP White Paper
Table 5) Annual savings
Savings ($) Savings (%)
Opportunity for Reinvestment
Clustered Data ONTAP allows maintenance and refresh activities to take place while the storage and its applications are active, without disrupting users.
For storage and application administrators, this can eliminate the need to negotiate for downtime, and all the activities associated with shutting down and restarting applications are not required.
$1,631,291
(26,100 hours)
30%
Real Direct Savings
With clustered Data ONTAP, because new controllers can be introduced into a cluster nondisruptively — either in place of existing controllers (using aggregate relocation) or to expand and rebalance the existing cluster — the newly purchased hardware assets can be put into service as soon as they are delivered. On average this adds 6 months to the productive life of a storage asset.
Additionally, each new generation of storage arrays brings better performance and increased density, allowing multiple workloads to be consolidated on fewer controllers, resulting in space, power, and cooling savings.
$1,524,234 37%
Cost Avoidance Savings
Storage administrators used clustered Data ONTAP HA architecture to perform nondisruptive software and hardware upgrades and also to perform break, fix, and replace operations
$657,720 31%
Assumptions:
2 hours spent by storage and application administrators on negotiating downtime
3 hours spent by application administrators for stopping, starting, and verifying applications
Software and hardware upgraded once a quarter
Applications rebalanced twice a year on average
Average annual salary of an IT staff person = $125,000
Average life of storage controllers is 4 years
Average of 10 applications run per HA pair
5 Conclusion
NetApp’s vision of NDO goes far beyond eliminating unplanned downtime or even common forms of
planned downtime such as maintenance operations. It seeks to eliminate even the more common forms
of downtime that result from lifecycle operations such as technology refresh and rebalancing for capacity
and performance. Eliminating planned downtime reduces opex by eliminating the need for negotiating
downtime windows, shutting down applications before a maintenance window, and restarting them
afterward. Additionally, being able to carry out lifecycle operations nondisruptively helps reduce capex
through improved utilization, better consolidation, and extending the useful life of new storage assets.
Finally, being able to carry out maintenance operations proactively helps reduce the risk of unplanned
downtime caused by secondary failure.
12 Nondisruptive Operations and Clustered Data ONTAP White Paper
NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer’s responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.
© 2014 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo, Go further, faster, are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. WP-7195-0114
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.