White Paper Nondisruptive Operations and …SMB 3.0 for Microsoft® Hyper-V® and SQL Server® environments with continuously available file shares support complete nondisruptive operations

White Paper

Nondisruptive Operations and

Clustered Data ONTAP Jay Bounds, Arun Raman, NetApp

March 2015 | WP-7195

2 Nondisruptive Operations and Clustered Data ONTAP White Paper

TABLE OF CONTENTS

1 Introduction ........................................................................................................................................... 3

2 Nondisruptive Operations (NDO) and Clustered Data ONTAP ........................................................ 3

2.1 Planned Events ...............................................................................................................................................4

2.2 Unplanned Events ...........................................................................................................................................7

3 Protocols and NDO ............................................................................................................................... 8

3.1 Windows File Services (WFS) .........................................................................................................................8

3.2 Network File Services (NFS) ...........................................................................................................................8

3.3 Storage Area Network (SAN) ..........................................................................................................................8

3.4 Metro Cluster (MCC) .......................................................................................................................................8

4 Cost Savings from NDO ....................................................................................................................... 9

4.1 Opportunity for Reinvestment (Opex Savings) ................................................................................................9

4.2 Real Direct Savings (Capex Savings) .............................................................................................................9

4.3 Cost Avoidance Savings .................................................................................................................................9

4.4 Case Study .....................................................................................................................................................9

5 Conclusion .......................................................................................................................................... 11

LIST OF TABLES

Table 1) Technology Refresh use cases and benefits. ...................................................................................................4

Table 2) Maintenance operation use cases and benefits. ..............................................................................................6

Table 3) Protection against unplanned events. ..............................................................................................................7

Table 4) NDO support matrix for different operations. ....................................................................................................8

Table 5) Annual Savings .............................................................................................................................................. 10

LIST OF FIGURES

Figure 1) NDO and clustered Data ONTAP. ...................................................................................................................3


1 Introduction

As IT operations become more and more critical to business, no downtime can be tolerated. Downtime

causes lost business, poor customer satisfaction, and competitive weakness. Today’s “always-on” data

centers are all about how the underlying infrastructure provides continuous data availability to the

applications despite changing business needs. From a storage perspective, this means a flexible storage

infrastructure, which allows businesses to start small and grow nondisruptively as the business grows.

This flexibility reduces up-front costs and enables businesses to build their data infrastructure with no

disruption to users or applications.

2 Nondisruptive Operations (NDO) and Clustered Data ONTAP

NDO is not merely a feature or an attribute of clustered NetApp® Data ONTAP

®; it’s an architectural vision

for eliminating all forms of downtime from storage environments. Figure 1 classifies operations and

events in the data center that typically incur downtime. This paper describes how clustered Data ONTAP

can end downtime resulting from these events.

Figure 1) NDO and clustered Data ONTAP

Lifecycle operations are the operations that a customer performs to optimize the storage environment to

meet business SLAs while maintaining a cost-optimized solution. These operations include moving

datasets around the cluster to different tiers of storage and storage controllers to optimize the

performance level of the dataset and to manage capacity allocations for the future growth of the dataset.

Maintenance operations are operations during which components of the storage subsystem can be

maintained or upgraded without incurring any data outage; for example, replacing a single hardware

component such as a disk drive or shelf fan; or even replacing a complete controller, shelf, or system.

The idea is that although data potentially lives forever, hardware does not, so maintenance and

replacement of hardware will happen one or more times over the lifetime of a dataset.

Infrastructure resiliency is the base building block for the storage subsystem; it prevents a customer

from having an unplanned outage when a hardware or software failure occurs. Infrastructure resiliency is

made up of redundant FRU components, multipath HA controller configurations, RAID, and NetApp

WAFL® (Write Anywhere File Layout) software enhancements that help prevent failures from a software

perspective. For node hardware failures or software failures, HA failover allows a node in an HA pair to

fail over to its partner.


At a high level, there are two types of downtime – unplanned and planned. Unplanned downtime is

usually the result of component failure in the system, site failure, or human error; planned downtime is an

activity that can be scheduled or planned.

2.1 Planned Events

As critical as it is to protect against unplanned downtime, the most common form of downtime is the

planned variety. By some estimates, 80% to 90% of all downtime in data centers is planned. Planned

downtime falls into two categories – maintenance operations (capacity and performance balancing) and

lifecycle operations (technology refresh).

Capacity and Performance Balancing

Because of the difficulty of predicting future capacity and performance usage of applications and

changing business needs, it is quite common for capacity or performance hot spots to emerge in a

storage cluster. Usually, the organization’s response to this unpredictability is to significantly

overprovision storage, leading to wasteful underutilization of storage resources. Even when data

migration is necessary, it usually requires application-level downtime. Given the frequency of capacity

and performance bottleneck events (often weekly events in many customer environments), NetApp

internal research suggests that nearly 40% of all benefits of NDO result from improving utilization from

capacity or performance rebalancing.

Technology Refresh

Most hardware assets, including storage arrays, are on a 3-to-4-year depreciation cycle and are replaced

by the latest gear available. This process usually involves copying existing data and migrating existing

applications to the new array. Tech refresh is a top-of-mind storage initiative for administrators for multiple

years running (source: The InfoPro Group). According to the same source, nearly 70% of all data

migrations are done for tech refresh.

Table 1) Technology Refresh use cases and benefits

Use Case Benefits

Capacity and Performance Balancing

Clustered Data ONTAP, with its nondisruptive data migration capabilities, such as LIF migrate , NetApp DataMotion™ for Volumes (VolMove), and DataMotion™ for Luns (LunMove), allows customers to remediate capacity or performance bottlenecks by moving them to another part of the cluster that has excess capacity or performance, thus avoiding application outage and reconfiguration and also leading to an overall dramatically higher level of storage utilization.

Learn more…

Saves administrators’ time by eliminating the need for

Negotiating downtime with application administrators

Shutting down multiple apps

Restarting multiple apps

Verifying that everything is working correctly

Accelerates the use of new hardware sooner: Tech refreshes are done in hours compared to weeks

https://library.netapp.com/ecmdocs/ECMP1196907/html/GUID-6545ED4C-726D-4A72-BE7B-6F9BB1746138.html


Storage Tiering

Data storage volumes and luns are a dynamic resource for workloads. The required priority of any given volume or lun is a function of the workload’s SLA requirements and the capacity growth expectations. DataMotion for Volumes and DataMotion for Luns provides transparent movement of volumes and/or luns to alternate resources within the cluster to accommodate the resource needs of the workload on the volume.

Learn more…

or months.

Scale-Out

DataMotion for Volumes allows customers to scale a storage system as needed by adding either new nodes or additional disk shelves and relocating data volumes to new hardware to balance provisioned resource pools on demand.

Hardware Tech Refresh

Clustered Data ONTAP offers two solutions for doing tech refresh:

Aggregate relocation (ARL) capability makes it possible to replace NetApp storage controllers nondisruptively without the need to copy data.

Clustered Data ONTAP nondisruptive DataMotion for Volumes can be used to migrate existing data to new arrays.

Learn more…

Maintenance Operations

Maintenance operations are typically done by using the built-in functionality of the storage array. They

include operations such as software and firmware upgrades, as well as replacing broken components

nondisruptively.

Software and Firmware Upgrades

Usually, each storage array has multiple types of firmware, such as disk firmware and shelf firmware. As

new software revisions are released, the ability to upgrade all of these nondisruptively is essential.

Clustered Data ONTAP has the ability to upgrade Data ONTAP software as well as the disk and shelf

firmware nondisruptively.

Data ONTAP 8.3 introduces the Automated Nondisruptive Upgrade (ANDU) method which simplifies the

Nondisruptive Upgrade(NDU) process by performing ANDU within the 8.3 release family (minor NDU

https://library.netapp.com/ecmdocs/ECMP1196995/html/GUID-A03C1B1E-3DE2-455D-B5AD-89C1389EF0C8.html

https://library.netapp.com/ecmdocs/ECMP1196905/html/GUID-9B8FB634-7745-4F5C-A818-6B5621957C3D.html


only). The new cluster image update command installs the target Data ONTAP image on each node

,validates the cluster components and nondisruptively executes upgrade.

Break, Fix, and Replace Operations

Typically, when a component fails, a redundant component takes over to maintain high availability.

However, at this point the storage array is running in degraded mode, leaving it susceptible to a

secondary failure leading to outage. Therefore the ability to replace broken components nondisruptively is

critical to maintaining redundancy and full availability. Clustered Data ONTAP makes it possible to

nondisruptively replace storage controllers, subcomponents inside the controllers (such as HBAs, NICs,

motherboard, NVRAM, and so on), switches, cables, and disks.

Table 2) Maintenance operation use cases and benefits

Use Case Benefits

Software Upgrade

HA architecture offers easier software upgrade without the need for downtime

Disk firmware, disk shelf firmware, and so on can be replaced nondisruptively.

Learn more…

Eliminate deferred software and firmware upgrades

Be on the latest software faster and start using the latest features sooner

Saves administrators’ time by eliminating the need for

Negotiating downtime with application administrators

Shutting down multiple apps

Restarting multiple apps

Verifying that everything is working correctly

Eliminate deferred maintenance and spend less time in degraded mode

Eliminate deferred maintenance and apply changes immediately

Replace failed component

A failed component is taken over by a redundant component to maintain high availability.

Failed components can be replaced nondisruptively, maintaining redundancy and full availability.

Change configuration

Additional components can be added or upgraded to improve performance or add functionality. For example, NetApp Flash Cache™ can be added to a controller nondisruptively without the need for any application downtime.

https://library.netapp.com/ecmdocs/ECMP1368708/html/frameset.html


Nondisruptive shelf removal allows old disk shelves to be retired without the need to bring down the system.

2.2 Unplanned Events

Historically, most storage (and even server or application) systems were designed to protect against

unplanned downtime by eliminating all single points of failure and building redundancy at all levels.

NetApp storage systems are no exception – from redundant hardware components inside the storage

controller to a dual controller HA architecture. Clustered Data ONTAP provides proven five 9’s availability,

with less than 5 minutes of downtime per year.

Table 3) Protection against unplanned events

Resiliency Benefits

Storage Failures

High availability (HA) storage controller pairs, multipath HA , and alternate control path (ACP) eliminate any single points of failure, from host to data path.

NetApp RAID-DP®, ACP,

persistent NVRAM write logs, service processor protection, and dual power supplies provide hardware availability. Learn more…

Greater than five 9s availability

No single point of failure

Disk Failures

NetApp RAID-DP is an advanced RAID 6 technology. At a cost equal to or less than that of RAID 5 and with excellent performance, RAID-DP technology offers double-parity protection against data loss. In other words, RAID-DP enables “peace of mind” enterprise storage without compromise.

Learn more…

Path failures

Clustered Data ONTAP implements industry-standard Asymmetric Logical Unit Access (ALUA) to manage multiple I/O from host to storage and to handle path failures.

Learn more…

https://library.netapp.com/ecm/ecm_download_file/ECMP1367947

http://www.netapp.com/us/system/pdf-reader.aspx?pdfuri=tcm:10-60325-16&m=tr-3298.pdf




3 Protocols and NDO

Nondisruptive operations are perceived differently by the client and by the application. Most protocols

have built-in ability to handle some disruptions caused by data migration and switch-over, and to hide

disruption from applications and users. Clustered Data ONTAP supports all common protocols (NFS,

CIFS, FC, FCoE, and iSCSI), and enhancements are made at both the protocol layer and the Data

ONTAP layer to make it nondisruptive.

3.1 Windows File Services (WFS)

SMB 3.0 for Microsoft® Hyper-V

® and SQL Server

® environments with continuously available file shares

support complete nondisruptive operations for all scenarios, including LIF migration, volume move, and

storage failover and giveback. However due to SMB 1.0 and 2.0 protocol shortcomings, the client

connections get dropped during LIF migration or storage failover, but they are nondisruptive for volume

move operations. SMB 2.0 introduces durable file handles, which allow an application to use the existing

open file handles after a LIF migration. In summary, the minimum requirement for NDO is SMB 2.0 or

later. Learn more…

3.2 Network File Services (NFS)

NFS clients normally retry to connect to the storage. If the timeouts are configured correctly for the

protocol and at the application level, NFS clients preserve nondisruptive operations.

3.3 Storage Area Network (SAN)

Clustered Data ONTAP asymmetric active-active architecture is based on the ALUA standard. ALUA

enables host multipath software to distinguish between optimized and unoptimized paths and to switch

over to unoptimized active paths when the current active path is down due to failure or controller

maintenance.

Table 4) NDO support matrix for different operations

Protocol LIF Migration Volume Move LUN Move Failover and

Giveback

Aggregate Relocation

(ARL)

WFS * ** **

NFS

SAN

* Only SMB 2.0 and later are nondisruptive. ** Only SMB 3.0 with continually available file shares is nondisruptive.

3.4 MetroCluster (MCC)

MetroCluster extends the nondisruptive operations of clustered Data ONTAP, providing protocol-

transparent site-level recovery. An HA Cluster at each site can service independent workloads and

provide bidirectional switchover capability in the event of a planned operation or site-wide disaster. The

result is zero RPO and near zero RTO for the hosted applications.

MetroCluster provides both automatic local failover (HA) capability as well as simple, one-command site-

wide switchover. Automatic synchronous propagates all data and configuration changes to both sites.

Applications, hosts, and clients can integrate easily into MetroCluster and there is little to no ongoing

administrative effort required. Other clustered ONTAP features such as storage efficiency, Quality of

Service (QoS) and SnapMirror/SnapVault data protection are seamlessly integrated into MetroCluster.

Monitoring and alerting is provided through OnCommand Unified Manager.



4 Cost Savings from NDO

Cost savings related to downtime have typically focused heavily on eliminating unplanned downtime – by

demonstrating savings related to avoiding lost employee productivity, lost revenue from unserved

customers, loss of customers to competitors, damaged reputation, and so on. Although these savings are

real and quantifiable, the impact of eliminating planned downtime is far more substantial: Based on our

research, 70% to 80% of NDO benefits are achieved by enabling the elimination of planned downtime.

These cost savings from eliminating planned downtime with clustered ONTAP fall into three categories:

opex savings from improving administrator productivity, capex savings from increased utilization, and

reduced risk from proactive maintenance and upgrades.

4.1 Opportunity for Reinvestment (Opex Savings)

For planned downtime, the argument could be made that it can be scheduled at a relatively low-activity

time of the day, week, or quarter, thus minimizing the impact to the organization and the end user.

However, in a shared storage environment, coordinating these low-activity periods among multiple

stakeholders is a difficult task. The actual outage requires effort from multiple administrators to shut down,

verify, start up, verify, and so on. It may not even be possible to negotiate a suitable window for the

downtime.

4.2 Real Direct Savings (Capex Savings)

There are two sources of capex savings. First, nondisruptive data migration to remediate application

capacity and performance bottlenecks allows customers to dramatically increase the overall utilization of

their storage systems.

The second source of capex savings is tech refresh. Without NDO and clustered Data ONTAP, newly

purchased storage frequently sits boxed up, waiting for a window for downtime to install the new

equipment, or for disruptive data migrations to be completed, while the clock has already started ticking

on monthly maintenance charges and depreciation.

4.3 Cost Avoidance Savings

The third source of NDO savings from eliminating planned downtime is reduction in risk. Being able to

perform maintenance operations such as software and firmware upgrades, as well as replacing broken

components proactively, significantly reduces the risk of unplanned downtime resulting from a secondary

failure. Without NDO, these upgrades are often delayed for lack of a downtime maintenance window,

making the storage system vulnerable to issues with known fixes. Similarly, the lack of maintenance

windows forces customers to continue to run in degraded mode resulting from the failure of the primary

component.

Our research suggests that more than 20% of all savings from NDO are achieved by reducing the risk

that results from deferred maintenance.

4.4 Case Study

This case study examines the IT department of a large technology company. The company has 145

NetApp HA installations supported by 21 storage administrators and 21 application administrators. With

145 solutions, at least one maintenance activity is performed on an HA pair every week. Storage

administrators must constantly negotiate with application administrators to get a common maintenance

window for any planned activity, thus bringing down their productivity.


Although the company’s infrastructure met most of their business needs, the IT team wanted to adopt

new technology that would serve the business needs much faster. The company’s IT department focused

on building a cost-competitive solution that meets business needs and also improves productivity of the

administrators and increases operations margins.

The IT department chose clustered Data ONTAP and immediately saw their administrators spending less

time in negotiating for downtime. Although the maintenance operations continued to be performed during

times of less activity, the hours of negotiation and the need to stop, start, and verify applications were

completely eliminated. Table 5 shows the annual savings that the company realized in terms of

administrator hours, capex expenditures, and risk avoidance.


Table 5) Annual savings

Savings ($) Savings (%)

Opportunity for Reinvestment

Clustered Data ONTAP allows maintenance and refresh activities to take place while the storage and its applications are active, without disrupting users.

For storage and application administrators, this can eliminate the need to negotiate for downtime, and all the activities associated with shutting down and restarting applications are not required.

$1,631,291

(26,100 hours)

30%

Real Direct Savings

With clustered Data ONTAP, because new controllers can be introduced into a cluster nondisruptively — either in place of existing controllers (using aggregate relocation) or to expand and rebalance the existing cluster — the newly purchased hardware assets can be put into service as soon as they are delivered. On average this adds 6 months to the productive life of a storage asset.

Additionally, each new generation of storage arrays brings better performance and increased density, allowing multiple workloads to be consolidated on fewer controllers, resulting in space, power, and cooling savings.

$1,524,234 37%

Cost Avoidance Savings

Storage administrators used clustered Data ONTAP HA architecture to perform nondisruptive software and hardware upgrades and also to perform break, fix, and replace operations

$657,720 31%

Assumptions:

2 hours spent by storage and application administrators on negotiating downtime

3 hours spent by application administrators for stopping, starting, and verifying applications

Software and hardware upgraded once a quarter

Applications rebalanced twice a year on average

Average annual salary of an IT staff person = $125,000

Average life of storage controllers is 4 years

Average of 10 applications run per HA pair

5 Conclusion

NetApp’s vision of NDO goes far beyond eliminating unplanned downtime or even common forms of

planned downtime such as maintenance operations. It seeks to eliminate even the more common forms

of downtime that result from lifecycle operations such as technology refresh and rebalancing for capacity

and performance. Eliminating planned downtime reduces opex by eliminating the need for negotiating

downtime windows, shutting down applications before a maintenance window, and restarting them

afterward. Additionally, being able to carry out lifecycle operations nondisruptively helps reduce capex

through improved utilization, better consolidation, and extending the useful life of new storage assets.

Finally, being able to carry out maintenance operations proactively helps reduce the risk of unplanned

downtime caused by secondary failure.


NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer’s responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.

© 2014 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo, Go further, faster, are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such. WP-7195-0114

Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications.

http://support.netapp.com/matrix/mtx/login.do

Documents

White Paper Nondisruptive Operations and …SMB 3.0 for Microsoft® Hyper-V® and SQL Server® environments with continuously available file shares support complete nondisruptive operations