Best practices in dr management and testing

This research note is restricted to the personal use of Aristotle Castro ([email protected]).

Best Practices for Planning and ManagingDisaster Recovery Testing16 August 2011 | ID:G00215785

John P Morency

Annual costs for disaster recovery testing can be as high as $150,000. Solutions fordiscovering and mapping software and data dependencies among Web-based applicationsis likely to become essential for DR testing/exercising, as part of an organization's bestpractices.

Overview

The time and resource costs of disaster recovery (DR) plan exercising, especially thatwhich is supported by manual or semimanual processes, has become the most significantIT DR management (IT-DRM) pain point for many of Gartner's clients. Specific steps canbe taken and technologies can be deployed to reduce recovery plan testing costs andcomplexities.

Key Findings

The annual costs of DR testing can reach or exceed $150,000 for many Gartnerclients.

These costs could go even higher, as new business applications are rolled intoproduction.

Tools capable of discovering and mapping software and data dependencies betweenWeb-based applications are likely to become essential for managing efficient andeffective recovery testing/exercising.

The need for more thorough business application inquiry and transaction testing willdrive enterprises to assess organizational and test management consolidation andintegration to more efficiently scale recovery testing in the future.

Recommendations

Evaluate IT service dependency mapping technologies from vendors such as BMCSoftware (Tideway), CA Technologies, HP, IBM, Neebula, ServiceNow and VMware toassess the extent to which they can simplify the testing process and make it morereliable.

Pilot software change management tools (from vendors such as BMC Software, CATechnologies; HP, IBM Maximo, SAP and ServiceNow) and procedures that have thepotential to most effectively synchronize change implementation between primaryproduction and secondary recovery data centers.

Evaluate the possible savings that can be gained by consolidating the applicationtesting resources, processes and tools used by the DR and quality assurance (QA)testing teams.

Print Document http://my.gartner.com/portal/server.pt/gateway/PTARGS_0_24...

1 of 7 9/23/12 4:07 PM

Analysis

DR testing is critical for supporting business resiliency. However, as the scope of mission-critical business processes, applications and data increases, sustaining the quality andthoroughness of the test process can be a challenge. Gartner client recovery andcontinuity-specific inquiries indicate that many enterprises are now implementing newapproaches for managing recovery exercising, mostly because of the increasing cost andlogistical complexity of traditional approaches.

Gartner research shows the importance of effectively managing recovery exercising costs.In one study of the exercising costs of federal government agencies (see "Cost-Cutting IT:Should You Cut Back Your Disaster Recovery Exercise Spending?"), clients reported thatIT-DRM annual exercise budget allocations ranged from $20,000 to more than $150,000,depending on the size, location, number of participants, scope of exercise andorganizational structure of the governmental unit. Results from nongovernment clientinquiries have shown that it isn't unusual for the annual cost of DR exercising to bebetween $75,000 and $150,000.

Gartner has identified some of the key reasons enterprises find DR testing increasinglydifficult and/or costly:

Increasingly complex dependencies — Web applications and services often havelogically meshed relationships with, and dependencies on, other applications anddata, some of which is often part of a lower recovery tier (see Table 1).

Inconsistencies — These occur between the current state of the data centerinfrastructure, applications and data, and their state at the time of the last recoverytest. This may affect the extent to which production applications and data can besuccessfully recovered, unless robust change and configuration managementprocesses (and tools) are in place. For example, a monthly volume of even a fewhundred changes to a data center's OS, middleware, applications or managementagents can result in a difference of thousands of changes between the currentproduction configuration and the production configuration at the time of the lastrecovery test.

Lack of resources — With the increasingly complex scope of testing, enterprisesrarely have adequate recovery testing resources to exercise all productionapplication inquiries and transactions on a regular basis. Some organizations testonly their most mission-critical applications. Others rotate testing amongapplications, while still others focus on systems that have failed previous tests. Afrequent result is that lower-priority applications are tested far less frequently, andtheir recoverability is qualified as being "… on a best effort basis."

Table 1. Recovery Tiers

Tier Service Levels

1 24/7 scheduled

99.9% availability (less than 45 minutes/month)

Recovery time objective (RTO) = two to eight hours; recovery point objective (RPO) =four hours

2 24/6 3/4 scheduled

99.5% availability (less than 3.5 hours per month)

RTO = eight to 24 hours; RPO = four hours


2 of 7 9/23/12 4:07 PM

Source: Gartner (August 2011)

In light of these challenges, Gartner is increasingly seeing clients rethink their teststrategies and implement a series of best practices.

Establishing a Minimum Acceptable Level of Recovery Testing

The 2011 Gartner Risk Management Survey shows that enterprises test recoverability, onaverage, once or twice a year. However, anecdotal evidence — based on more than 3,000DR-related Gartner client inquiries in a three-year period — suggests that fewer and fewerof these live tests involve all production applications and data. Instead, tests are specific toan individual recovery tier (typically, the recovery tier corresponding to the most mission-critical applications) or include an affinity group of production applications that haverelated software and data dependencies. This means that many organizations follow the80/20 rule — 80% of the testing is done on the applications that are the most mission-critical (which are often 20% or less of the total number of production applications).

Despite this data, however, you shouldn't completely ignore test procedures for less criticalapplications and data. Rather, IT must ensure the recovery of the business processes andsupporting applications, the loss of which would cause the greatest loss of revenue,productivity or organizational reputation.

In terms of how often an organization should conduct testing, we offer the followingbaselines, again subject to your organization's special circumstances:

Conduct live testing for Tier 1 and Tier 2 applications and data at least twice peryear.

Initiate more frequent (monthly, quarterly) manual or (ideally) automated testingon application affinity groups.

Perform failover and failback testing during the same or separate planned downtimeperiods.

Ensure that the required data restoration and application activation cycle timesmeet or beat the RTO and RPO targets.

Regardless of how you determine recovery tier definitions, it is important to begin thinkingabout how you can best test recoverability, especially for the most mission-criticalapplication data. Test more frequently the related applications and data that support asmaller set of key business processes, and shift the testing focus to how IT can best meetor beat the associated recovery targets.

Pain Point Remediation Alternatives

Automated Dependency Mapping

The challenge of ensuring that all required software and data dependencies are addressedin a recovery configuration will become more complex, as new business applications thathave been purchased, created by in-house development teams, or acquired through

Tier Service Levels

3 18/7 scheduled

99% availability (less than 5.5 hours per month)

RTO = one to three days; RPO = one day

4 24/6 1/2 scheduled

98% availability (less than 413.5 hours per month)

RTO = more than three days; RPO = one day


3 of 7 9/23/12 4:07 PM

merger and acquisition (M&A) activity are turned over to production.

Increasingly mature IT service dependency mapping tools can help. These products,available from vendors such as BMC Software, CA Technologies, HP and IBM, enable ITorganizations to discover, document and track relationships by mapping dependenciesamong the infrastructure components, such as servers, networks, storage andapplications, that form an IT service (see "IT Service Dependency Mapping Tools: MarketDynamics Update"). These tools are used primarily for applications, servers anddatabases; however, a few discover network devices (such as switches and routers),mainframe-unique attributes and virtual infrastructures, thereby presenting a completeservice map. Although these tools are often bought in conjunction with configurationmanagement database (CMDB) projects, we have seen a significant increase in theiracquisition and use for data center-specific projects, such as IT-DRM modernization anddata center consolidation.

Data dependency mapping products from 21st Century Software, AppAssure, Bocada,Continuity Software, InMage and Sanovi are software products that provide automateddata, metadata and index consistency assurance between production files and databasesand their replicas that are maintained at one or more recovery sites. Background softwareagents determine and report on the likelihood of achieving specified recovery targets,based on analyzing and correlating data from applications, databases, clusters, OSs,virtual systems, networking and storage replication mechanisms. These products performtheir consistency checking on data located on direct-attached storage (DAS), storage-area-network (SAN)-connected storage or network-attached storage (NAS) at the primaryproduction and secondary recovery data centers.

Synchronizing Distributed Change

Ensuring 100% change consistency between the production data center configuration,applications and data and their recovery data center counterparts is a challenging task. Ata minimum, the recovery infrastructure at the secondary site must be dedicated, althoughthis may not be the case for the recovery facility itself.

Typically, asynchronous data replication (either host- or storage controller-based) andserver virtualization are used to support a partial or full development and testingconfiguration that is used by in-house application development, support and testing teamsduring normal production hours. In this scenario, synchronizing changes between theprimary production and development and test (which can or might support recovery)configurations is typically managed by the development and testing teams, in conjunctionwith operations support. This may involve the automated replication of updated productionvirtual server images to the secondary configuration, in parallel or in tandem withproduction data replication.

Several product options support virtual server replication, including offerings from suchvendors as Acronis, Asigra, Atempo, BakBone Software, CA Technologies, CommVault,Double-Take Software, EMC, FalconStor Software, HP, i365, IBM, InMage, Microsoft,NetApp, Novell, PHD Virtual, Quest Software, Symantec, Syncsort and Veeam. However,for recovery configurations that include a mix of physical and virtual servers, as well as acombination of shrink-wrapped and in-house-developed applications, the use of IT processautomation tools that orchestrate infrastructure configuration, provisioning and changeupdating is likely to be required. (Further information on the current state of IT processautomation, change and configuration management can be found in "Hype Cycle for ITOperations Management, 2011.")

Consolidating Testing Personnel, Tools and Skill Sets

One approach that has met with some client success is consolidating what were previouslyseparate QA and recovery testing teams into a single organization. Organizational


4 of 7 9/23/12 4:07 PM

consolidation, together with the consolidation and standardization and testing platformsand scripts, is an approach that can be used to support preproduction turnover regression,as well as ongoing DR, testing.

Organizations that implemented this approach did so to address a lack of recovery testingbreadth and depth. Given the increasing numbers of mission-critical applications requiringrecovery, as well as the related numbers of inquiries and transactions, it became clear thatmanual or semimanual testing processes could only provide limited recovery assurance.This was because the extent to which a full set of production inquiries and transactionscould be consistently exercised by the recovery exercising team was limited by testingtime constraints.

In one specific instance, a recovery team was able to meet the required RTO and RPOtargets for the most mission-critical applications, but the recovery of the productionenvironment, as perceived by the business unit end users, was short-lived, becauseundiscovered (and, therefore, unaddressed) software and data dependencies resulted inseveral inquiries and transactions prematurely aborting or incurring unacceptably longresponse times. The net result was that the recovery team won the battle by supportingthe required RTOs and RPOs, but lost the war, because the usability and effectiveness ofthe recovery operations configuration was limited.

A new approach was needed that could not only improve the breadth and depth ofapplication testing coverage, but could increase the efficiency and effectiveness ofrecovery exercising as a whole. Following an assessment of the technical benefits and costsavings that could result from a merger of the internal QA and the DR testing teams, adecision was made to consolidate them into a single organization and to standardize themanagement and automation of test processes by leveraging many of the tools, scriptsand staff resources that were already in place.

The benefits that have been realized by some of the early adopters of this approachinclude increasingly reliable and more-effective test exercises, combined withmore-thorough testing of representative production inquiries and transactions against therecovery configuration. The latter improves the likelihood that recovery operations can beinitiated within required RTO and RPO targets, and ensures more stable recoveryoperations.

Summary

IT-DRM managers may recognize one or more of these approaches as potentially addingvalue to their IT-DRM programs. Regardless of which side of the issue you see yourorganization leaning toward, it is important to consider the key technologies yourorganization uses, because, for many organizations, the use of more traditional recoverytesting and technology that helps manage more sustained availability may not be so mucha case of "either/or" in the next five years, but rather a case of "and."

Recommended Reading

Some documents may not be available as part of your current Gartner subscription.

"Hype Cycle for Business Continuity Management and IT Disaster Recovery Management,2011"

"From Development to Production: Integrating Change, Configuration and Release"

"Predicts 2011: Improved Recoverability May Be on the Horizon, but Significant ChallengesRemain"

"Data Center Conference Poll Findings: Disaster Recovery Testing Mistakes"


5 of 7 9/23/12 4:07 PM

"Cost-Cutting IT: Should You Cut Back Your Disaster Recovery Exercise Spending?"

"Toolkit: Best Practices for a Successful Tabletop Recovery Test"

"Hype Cycle for IT Operations Management, 2011."

"IT Service Dependency Mapping Tools: Market Dynamics Update"

Strategic Planning Assumption

By the end of 2014, 15% of enterprises will have significantly reduced or eliminatedtraditional DR testing as a result of supporting more resilient IT operations.

© 2011 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction and distribution of this publicationin any form without prior written permission is forbidden. The information contained herein has been obtainedfrom sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness oradequacy of such information. Although Gartner's research may discuss legal issues related to the informationtechnology business, Gartner does not provide legal advice or services and its research should not beconstrued or used as such. Gartner shall have no liability for errors, omissions or inadequacies in theinformation contained herein or for interpretations thereof. The opinions expressed herein are subject tochange without notice.


6 of 7 9/23/12 4:07 PM


7 of 7 9/23/12 4:07 PM

Technology

Best practices in dr management and testing