4
served . February 2006 DISASTER RECOVERY T need in financial terms ness decisions should drive busi- . The lack of an effective business ecovery plan can expose organiza- ncial risk. Aside from lost income, esmirch a good brand name. Plus, in regulated industries—such as e risk of legal liability. terprises affix a monetary value to ning efforts to help ensure buy-in e organization. Additionally, enter- cial progress as they move through e business continuity plan. After all, ontinuity planning do not arrive en ization flips the switch on its new covery benefits are associated with uce the lead time to recovery , and nt technologies and processes, this ease. In the end, by viewing the BY RICH ARMOUR, PAUL ENO, MICHAEL KIMBLE, AND JESSE FREUND Architecting a Blueprint for Disaster Recovery Effective risk assessment and business continuity provisions enable organizations to minimize downtime and recover crucial applications quickly when disaster strikes. By analyzing business processes and functions and defining the impact of system downtime in financial terms, enterprises can effectively prioritize the enterprise-wide recovery process to help reduce business consequences. Based on internally honed best practices, Dell’s top 10 rules for disaster planning can help enterprises align recovery efforts with overall business objectives.

Architecting a Blueprint for Disaster Recovery · an active/active data center strategy as part of its everyday opera-tions. To do so, Dell provisions more than 100 percent capacity

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Architecting a Blueprint for Disaster Recovery · an active/active data center strategy as part of its everyday opera-tions. To do so, Dell provisions more than 100 percent capacity

served. February 2006

DISASTER RECOVERY

T need in financial termsness decisions should drive busi-

. The lack of an effective business

ecovery plan can expose organiza-

ncial risk. Aside from lost income,

esmirch a good brand name. Plus,

in regulated industries—such as

e risk of legal liability.

terprises affix a monetary value to

ning efforts to help ensure buy-in

e organization. Additionally, enter-

cial progress as they move through

e business continuity plan. After all,

ontinuity planning do not arrive en

ization flips the switch on its new

covery benefits are associated with

uce the lead time to recovery, and

nt technologies and processes, this

ease. In the end, by viewing the

BY RICH ARMOUR, PAUL ENO, MICHAEL KIMBLE, AND JESSE FREUND

Architecting a Blueprint for Disaster Recovery

Effective risk assessment and business continuity provisions enable organizations to

minimize downtime and recover crucial applications quickly when disaster strikes.

By analyzing business processes and functions and defining the impact of system

downtime in financial terms, enterprises can effectively prioritize the enterprise-wide

recovery process to help reduce business consequences. Based on internally honed

best practices, Dell’s top 10 rules for disaster planning can help enterprises align

recovery efforts with overall business objectives.

Page 2: Architecting a Blueprint for Disaster Recovery · an active/active data center strategy as part of its everyday opera-tions. To do so, Dell provisions more than 100 percent capacity

DISASTER RECOVERY

www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2006. Copyright © 2006 Dell Inc. All rights reserved. DELL POWER SOLUTIONS 9

business continuity strategy in monetary terms and tracking finan-

cial progress over time, organizations can align business continuity

efforts with smart business decisions.

Rule 2: Use hard data to create a risk profileBecause Dell’s data centers are located in Texas, the Dell business

continuity team might have assumed that the company’s primary

risk would be a catastrophic tornado. In fact, after analyzing hard

data, the team determined that the primary threat to data centers

is likely to be fire.

How did Dell construct its risk profile? The company invited

its insurance carriers to bring hard data to the business continuity

plan. Insurance carriers possess claims data that can tell organiza-

tions what the risks are in a given geographic area. Plus, insurance

carriers naturally want to help clients reduce risk. In the end, the

use of hard claims data from the insurance carriers offered some

of the most valuable and surprising information as Dell developed

its own business continuity plan.

Rule 3: Identify the critical resourcesAll data is not created equal, and the same holds true for applica-

tions. An effective business continuity planning process requires

organizations to undergo a rigorous analysis of business processes

and functions and to identify the critical resources that require

redundancy, backup, and recovery. Before organizations can discuss

the IT resources necessary to maintain business-critical processes,

they must assess the business impact of losing systems—paying par-

ticular attention to interdependencies that exist among systems. For

example, a business-critical system may rely on input from another

system that is not deemed critical in and of itself. It is crucial that

the business side of the organization lead the discussion of critical

processes before the IT side can define the technologies necessary

to enable business continuity.

Dell has developed a three-tier strategy for classifying data and

applications. Class 1 systems support business-critical processes.

For example, at Dell a business-critical process involves any service

that directly interacts with the customer, which includes taking

and processing orders as well as building, shipping, and servicing

products. By contrast, Class 2 systems correlate to business-essential

processes, where a 48-hour outage would begin to have a negative

business impact. Class 3 systems enable business-support processes,

for which a temporary loss of service is deemed noncritical. By clas-

sifying and defining processes, applications, and data along business-

criticality demarcations, Dell helps ensure that the appropriate

investment is made to recover the most crucial systems first.

Rule 4: Think beyond the data centerMany disaster recovery efforts are focused on keeping the data

center up and running. However, effective business continuity

planning must reach beyond

applications, data, and infra-

structure considerations. For

example, it does no good to have

the data center up and running if

no provisions have been made to

support people performing vital

business functions such as ship-

ping and receiving. Of course,

applications must be available,

data must be accessible, and the

network must be working. But

by focusing on the data center

at the expense of essential busi-

ness processes and infrastructure

components, enterprises run the

risk of turning robust data center

functionality into little more than

a paper tiger.

Rule 5: Eliminate or mitigate single points of failureA single point of failure occurs when there is no redundancy

to compensate for a missing application, data, or infrastructure

component. It may be an application or a database server, a lone

backup generator in a data center, or the long-haul network itself.

Moreover, several single points of failure typically exist within

an enterprise infrastructure. Consequently, organizations should

perform a specific and detailed single-point-of-failure analysis

across the entire infrastructure. Doing so may reveal that a key

f component was missed when a data center, or another form of

disaster recovery system, was built. In the end, performing a

single-point-of-failure analysis may help prevent an organization

from having to entirely reconstitute business capabilities when a

relatively minor component fails.

Rule 6: Assume that everything is going to failOftentimes, when enterprises build a disaster recovery plan, they

do so with the expectation that land lines, cell phones, and the

network will be available. Or they take for granted that the roads to

fthe data center will be accessible—assuming the data center itself

is still standing and the employees are capable of getting there. The

best-laid plans for business continuity include the consideration that

every key piece of internal and external infrastructure may fail or

become unavailable for extended periods.

fAt Dell, the operating assumption is that every vital piece of

infrastructure is capable of failing, and all of them may go down

fat the same time. Along these lines, the Dell recovery plan itself

is stored on CD. Copies of the CD are distributed across multiple

teams. At least one copy resides in each data center, and another

TOP 10 RULES FOR DISASTER PLANNING

1. Articulate the need in financial terms.

2. Use hard data to create a risk profile.

3. Identify the critical resources.

4. Think beyond the data center.

5. Eliminate or mitigate single points of failure.

6. Assume that everything is goingto fail.

7. Consider an active/active data center strategy.

8. Recognize potential vendor weaknesses.

9. Keep disaster recover capability up-to-date.

10. Perform tests on a regular basis.

Page 3: Architecting a Blueprint for Disaster Recovery · an active/active data center strategy as part of its everyday opera-tions. To do so, Dell provisions more than 100 percent capacity

DISASTER RECOVERY

DELL POWER SOLUTIONS Reprinted from Dell Power Solutions, February 2006. Copyright © 2006 Dell Inc. All rights reserved. February 200610

The Dell™ PowerVault™ ML6000 modular tape library isdesigned to intelligently protect an organization’s criticaldata. Through proactive diagnostics and flexible scalability,the PowerVault ML6000 enables organizations to preparetheir storage environments for disaster recovery.

The built-in intelligence of the PowerVault ML6000 helpsensure that backups can execute as planned. The library’sdiagnostics are designed to predict failures in the library’senvironment, such as tape or drive malfunction, and send e-mail messages to warn admin-istrators of potential issues. This proactive notification allows administrators to plan ahead and resolve problem conditions before failures occur—thus lim-iting unplanned downtime. If a failure occurs, administrators can use one of the library’s simpletroubleshooting wizards, which provide solutions to known issues ranging from cleaning orrestarting drives to opening a service ticket. The wizards helpadministrators save time by resolving issues on-site. If admin-istrators cannot resolve an issue using the wizards, detailedevent logs and built-in relational diagnostics isolate failuresat a subcomponent level, minimizing the time to repair of thePowerVault ML6000.

The tape library’s modular and versatile scalability pro-vides organizations with several capacity options, offeringorganizations the flexibility to pay as they grow without lim-iting their ability to add more drives or slots to the existinglibrary. The PowerVault ML6000 can scale from 14.4 TB ofnative backup storage capacity to 51.2 TB (native) using 9Uexpansion modules to support the demands of workgroupand mid-range data centers; up to 161 TB of capacity isplanned for future releases. The library scales from two tosix Ultrium 3 Linear Tape-Open (LTO-3) SCSI or Fibre Channeldrives; future support is planned for up to 18 LTO-3 SCSI orFibre Channel drives. In addition, from 36 to 128 cartridgeslots give organizations added backup performance andcapacity; support for up to 404 cartridge slots is planned forfuture releases. To satisfy ever-shrinking backup windows,

the PowerVault ML6000 is designed to provide a maximumnative transfer rate of 1.7 GB/hour.

The PowerVault ML6000 control module can be placedanywhere in the expansion stack so that organizations caneasily expand and customize their libraries. Each additional expansion module leverages the existing robotics and intel-ligence of the control module to reduce the overall numberof moving parts, thus enhancing the library’s reliability.

The tape library also offers a wide range of connectivity and compatibility options for pro-tecting storage environments. The drive technology used in thePowerVault ML6000 can expand to support different connectiv-ity standards, including SCSI and Fibre Channel. The library is also compatible with storagesoftware from CommVault, EMC, and Symantec.

In addition, the PowerVault ML6000 is available with two Dell Services offerings. The Backup and Recovery Design and Deployment service provides a detailed backup and recovery plan that is designed to help organizations establish appro-priate procedures to minimize or avoid data loss. Meanwhile,the Backup and Recovery Implementation service is designed to be a comprehensive approach for organizations imple-menting a backup and recovery process on a new Dell orDell/EMC storage area network or network attached stor-age solution, and may include software from key third-partytechnology providers.

The drive and media technology used by the PowerVault ML6000 provides robust backup and restore functionality as well as effective, long-term data retention. For example, the library’s use of LTO-3 drives and WORM (write once, read many) media aids in regulatory compliance by preventing data from being overwritten or digitally altered while providing excellent tape drive performance.

The Dell PowerVault ML6000 tape library is opti-mized for Dell PowerEdge™ servers and Dell’s compre-hensive storage portfolio. For more information, visit www.dell.com/storage.

INTELLIGENT DATA PROTECTION: THE DELL POWERVAULT ML6000 TAPE LIBRARY

Figure A. Dell PowerVault ML6010 CM tape library

Page 4: Architecting a Blueprint for Disaster Recovery · an active/active data center strategy as part of its everyday opera-tions. To do so, Dell provisions more than 100 percent capacity

DISASTER RECOVERY

www.dell.com/powersolutions Reprinted from Dell Power Solutions, February 2006. Copyright © 2006 Dell Inc. All rights reserved. DELL POWER SOLUTIONS 11

copy is kept in the IT operations center. This way, if a disaster

cripples phone service, Internet availability, or transportation infra-

structure, Dell still has the capability to begin recovery.

Rule 7: Consider an active/active data center strategy One way to know that a recovery plan will work is to make it a part

of the load-balancing activities. Along these lines, Dell relies on

an active/active data center strategy as part of its everyday opera-

tions. To do so, Dell provisions more than 100 percent capacity

for each application so that it can split application load balancing

across multiple data centers. Each application has 75 percent of

required capacity in each data center—lending each application

150 percent of its nominal capacity requirement. Not only does

this load-balancing strategy translate to high-performance applica-

tions, but it also helps ensure that disaster recovery and failover

capabilities are being tested every moment of every day. This

way, when Dell needs to implement its disaster recovery plan,

the company knows it will work because it is already part of the

existing load-balancing strategy.

Rule 8: Recognize potential vendor weaknessesCritical vendors can significantly affect an enterprise’s capacity

for disaster recovery. When putting together a business continuity

plan, organizations must evaluate a vendor’s own disaster recovery

capabilities to understand how their potential weaknesses might

hinder the enterprise. For example, after the 9/11 terrorist attack,

many customers turned to Dell to rapidly reconstitute systems that

had been destroyed or damaged. Thanks to its size and scalability,

Dell was able to respond quickly to get these customers back online.

Had Dell been smaller and less scalable, hardware procurement

might have become a significant bottleneck in its customers’ disaster

recovery capabilities.

Rule 9: Keep disaster recovery capability up-to-dateDisaster recovery capability can quickly become outdated. It must be

maintained by a strong set of procedures and processes, so it becomes

part of the everyday, every project, and every implementation culture.

As each new project or application is initiated, enterprises need to

perform an analysis of where it fits in the criticality matrix. If, for

instance, a new application is deemed to support a Class 1 business

process, then the application must be engineered with the appropriate

recoverability. Plus, that capability must be maintained going forward.

As changes are made to applications, databases, and data centers,

disaster recovery capabilities should be updated as well.

Rule 10: Perform tests on a regular basisEnterprises can never assume that their disaster recovery capability

is actually working. Dell tests its failover and recovery processes

on a quarterly basis. Aside from validating that Dell does indeed

have failover and recovery capabilities, these quarterly tests help

keep the business continuity plan in front of the infrastructure and

application teams, which helps encourage future development with

business continuity in mind. The quarterly tests represent an impor-

tant part of the effort to make disaster recovery part of the everyday,

every project, and every implementation culture at Dell.

Align recovery efforts with business objectivesEffective disaster recovery and business continuity planning

depends on an enterprise’s ability to identify critical processes and

technologies, maintain and recover functionality after a planned or

unplanned event, and balance the risks with the costs of continuity

efforts. In turn, this effectiveness requires an alignment of business

continuity planning with articulated business goals. To align busi-

ness objectives with continuity efforts, enterprises must develop a

risk profile based on hard data. Furthermore, the business side of

the organization should guide the development of the risk profile. By

basing disaster recovery and business continuity efforts on business

objectives and by refining these practices over time, organizations

can develop a plan that not only pays dividends in the event of an

unfortunate event, but also helps organizations realize efficiencies

in their day-to-day operations.

Rich Armour is a director on the Dell Information Technology team. He has a B.S. in Computer Science and Mathematics from Eastern New Mexico University and an M.B.A. from George Washington University.

Paul Eno is a senior manager on the Dell Information Technology team. He has a B.S. in Engineering from the U.S. Military Academy at West Point, New York, and an M.B.A in Financial Information Management from The University of Texas at Austin.

Michael Kimble is an enterprise technologist in the Advanced Systems Group at Dell. He focuses on storage solutions for business continuity and disaster recovery.

Jesse Freund is a business and technology writer based in San Francisco. He has written about business and technology for leading publications, cor-porations, and organizations, including Business 2.0 and Wired magazines. Jesse has a B.A. in History from the University of California, Berkeley.

FOR MORE INFORMATION

Dell business continuity:www.dell.com/disasterrecovery www.dell.com/enterprise www.dell.com/storage www.dell.com/services