Disaster Recovery BSS Data Center

8/8/2019 Disaster Recovery BSS Data Center

1/53

Disaster Recovery for a

BSS Data Center

1DR for a BSS Data Centre


2/53

Disaster Recovery: The Lighter Side



3/53

DR for a BSS Data Centre 3

Section 1

Disaster Recovery Overview


4/53

What is a Disaster?

Hazard which has come to realization

Perceived tragedy

Natural calamity

Man-made catastrophe

Disasters are the consequence of

inappropriately managed risks



5/53

Risks to be Addressed



6/53

What is Disaster Recovery in IT

Perspective?

Timely and effective restoration of IT services

in a major incident

Any plan or set of procedures implemented by

a business to maintain uptime and/or prevent

data loss in the event of a system failure



7/53

Disaster Recovery

People

Staff, Outsourced

Process

Crisis Management

Technology

Hardware, Software


IT


8/53

Metrics for Disaster Recovery (1/2)

Driven by two metrics

Recovery Time Objective (RTO)

Interrupted for how long?

Recovery Point Objective (RPO)

How much data loss?



9/53

Metrics for Disaster Recovery (2/2)

DECLARE

DISASTER

10 a.m.

Recovery Point Objectives(RPO)

Recovery Time Objectives(RTO)

RPO: Amount of data lost from failure,

measured as the amount of time from a

disaster event

RTO: Targeted amount of time to restart a

business service after a disaster event

5a.m.

6a.m.

7a.m.

8a.m.

9a.m.

10a.m.

11a.m.

12a.m.

1p.m.

2p.m.

3p.m.

4p.m.

5p.m.

6p.m.

7p.m.



10/53

Understanding RPO and RTO

Cost of downtime per hour

Employee cost per hour + Cost of problem repair + Cost ofemployee overtime

Loss of customer

Reputation of Company

Recovery Point Objective (RPO)

A point in time to which the data must be recovered

An acceptable loss of data during disaster situation

Recovery Time Objective (RTO) The duration of time within which a business process must be

restored after a disaster (underlying infrastructure andapplication components are restored first)



11/53

Investment Scenario



12/53

High Availability v/s Disaster Tolerance

High Availability Providing redundancy within a data center to maintain the

service (with or without a short outage) Hardware failures

Software failures Human error

Disaster Tolerance Providing redundancy between data centers to restore the

service quickly (tens of minutes) after certain disasters

(dedicated equipments) Power loss

Fire, flood, earthquake

Sabotage, terrorism



13/53

Availability Events (1/2)

Planned Outages Network and power related changes

Hardware repair

Hardware and/or software upgrades Software maintenance

OS

Database

Applications

Data backup and storage management As data grows in size, tape backup is less effective

What data must be archived

How is the data archived?



14/53


15/53

What causes the most Downtime?


Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008


16/53

Measure of Availability


Hours of downtime

per yearper IT service

Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008


17/53


Section 2

Architecture & Sizing for DisasterRecovery


18/53

2-Site Architecture

100% Primary Site + 100% DR Site

Database changes are more frequent hence logbased replication of database between Primary

and DR site. Sync replication is not possible because ofWAN

bandwidth

A-synch Replication is possible

RPO -> Depends on how much data to bereplicated,

RTO -> Depends upon People + Processes



19/53

2-Site Architecture:Working



20/53

SANStorage

VolumeGroup

Application

files VG

Archive logs VG

SANStorage

VG

Application

files VG

Archive logs VG

Asynchronous ReplicationStorage Tier

DB Tier

ApplicationTier

DBCI

servers in

Cluster

Application Servers Application Servers

Primary Site

(ACTIVE) DR Site

DBCI

servers in

Cluster

Dark Fiber



21/53

2-Site Architecture

Advantages

Simple to manage

Less expensive than other solutions

Only one link needs to be procured

Disadvantages

RPO of 15 minutes is not quantifiable (Impact couldbe high or low)

Cannot estimate what kind of data loss will happen

RTO for DR site cannot be quantified to businessbecause of lost transactions.



22/53

3 Site Architecture (for RPO=0)



23/53

For RPO=0 Must have synchronous replication of database

Synchronous replication has limitations on

distance (40 to 60 km) Hence cannot replicate synchronously for long

distances

But can replicate short distances

So a 3 Site ( primary, Near, DR)solution might

achieve RPO=0 (Almost)



24/53

What case will RPO be zero

Regional disasters which dont destroy primary

and Near site at the same time.

For all kind of DC failures RPO=0 can be achieved In case of regional disaster which wipes out both

Primary and Near site, RPO will depend upon the

link between Primary and DR( could be 15

minutes depending upon the size of the link)



25/53

SAN

StorageVolume

Group

Application

files VG

Archive logs VG

SANStorage

VG

Application

files VG

Archive logs VG

SAN

Storage

VG

Application

files VG

Archive logs VG

Synchronous ReplicationAsynchronous

Replication

DBCI

servers in

Cluster

DBCI

servers in

Cluster

Application Servers Application Servers Application Servers

DBCI

servers

Primary Site Near Site DR Site

WAN link



26/53

3 SITE ARCHITECTURE:Working


Site A Site B

Distance < 25 kms

Dark Fibre

PROD

Site C DR

Near/ Bunker


27/53

3 Site DR considerations

What should a Near site must have

Different & multiple power source/ power grid

Network Termination exactly same as Primary DC

(if Near site has to be used for Primary site

operations)

Replication links from multiple vendors (No SPOF)

Link to DR site



28/53

What should be in the Near Site??

Option1 : Full 100 % Replica

of the Primary Site

High cost (Infrastructure +

People0

Servers, storage, firewalls,

switches, backup, power

sources

Applications, Databases, etc

Security, Personnel, Processes

Network Connectivity

Would protect against any

local problems at Primary DC



29/53


Option 2: Split Configurationbetween primary and NearSite Database servers split

between primary and NearSite (extended cluster)

When Primary DC failsoperations move to Near Site

Maintenance and continuousupkeep of the of the Near Site

essential Redundancy required in case

of Application Servers,Firewall, routers, Servers,Backup etc



30/53


Option 3: Minimalist Treat Near site only for

RPO=0 purpose and not foroperations

Replicate storagecontinuously for RPO=0

Keep only that hardwarewhich can push data fromNear sit to DR in case ofprimary DC failure.

Keeps the simplicity of 2 SiteDR which RPO=0 for 3 Site

RPO=0 not achieved ifPrimary and Near Site godown together



31/53


Section 3

Connectivity to DR Site


32/53

Connectivity

The majority of businesses deploy wide area networks (WANs) to

connect the remote parts of the business back to centralized

resources

Bandwidth is always an issue in disaster recovery. If you'rereplicating data for potential failoverboth locally and remotely

then your bandwidth issues become more complicated.

We want to establish a DR site that's far enough away that it won't

be affected by the same disaster, but not so far away that WAN

bandwidth costs will be prohibitive.



33/53

The physical distance involved will often dictate the type of replication

used to move data between sites.

They are two types of replication:

1) Synchronous replication

2) Asynchronous replication

Synchronous replication moves data in real time so that the data center

and DR site contain the same data moment to moment, but synchronous

data transfers often need high-bandwidth

Asynchronous replication moves data on a bandwidth-available basis.

This allows data movement using cheaper, lower-bandwidth connections,

but presents a possibility of data loss because the data center and DR

site may be out of sync by up to several hours



34/53

With the popularity ofIP connectivity there are lots of connectivity

options available.Connectivity on SAN can be done by many options

like:-

Ethernet

FC (Fibre Channel)

iSCSI (Internet Small Computer System Interface)

FCIP

FCoE (Fibre Channel over Ethernet)

The sites can be connected by a VPN, which provides cost benefits

1) Ethernet

Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre

Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet

(10GigE) is widely available for data centers

2) Fibre Channel

Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned untilrecently. Today, 4 Gbps FC is readily available and 10 Gbps implementations areappearing on some high-end systems and director-class switches.



35/53

3) iSCI (Internet Small Computer System Interface)

iSCSI to transfer data over LANs,WANs or the Internet and supports storage management

over long distances.

The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP

packets for transmission over an Ethernet connection, rather than a Fibre Channelconnection.

iSCSI still has two disadvantages for storage:-

At 1 GigE, it does not perform as fast as Fibre Channel.

And Ethernet will drop packets during network congestion.

These problems may be alleviated soon, thanks to the emergence of 10 GigE and DataCenter Ethernet

4) FCIP

. FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged

between distant Fibre Channel SANs. It's important to note that FCIP only works to connect

Fibre Channel SANs, but iSCSI can run on any Ethernet network.

5) FCoE

Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable

SAN and LAN convergence



36/53

Requirements

To establish WAN connectivity between the Central Location to 2 remote

locations for Data Transfer Application.

The leased line based network design primarily to be used for

implementing the Online Data Transfer Application with the auto ISDN

backup connectivity.

The connectivity from the Central Location to the remote locations at64Kbps to 2 Mbps speed.

The connectivity to be always on.

The Network Devices to be SNMP managed.

Provision for future scalability.



37/53



38/53

DAX Network

Central Location:

At the Central location, Dax recommended the customer to opt for 1 no. of DX-

2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module

Support.

The router was populated as follows:

Slot 1 2-ports Sync/Async Serial Module (speed up to : 2Mbps)

Slot 2 4-port ISDN U module.

Remaining 2 slots were left free for future scalability.

Remote Location:

At the Remote location, Dax recommended each remote branch to use DX-1721

Modular Router with 1# 10/100 port and 4 WAN Slot forWAN/VOIP modules.

Each DX1721 was loaded with the following modules:

Slot 1 - ISDN S/T module for providing automatic back-up connectivity.

Slot 2 - 1-PortHigh speed Serial Sync / Async WAN Interface module for

connecting leased line link @ 64 Kbps up to 2 Mbps Speed.

The remaining 2 slots were left free for future scalability.



39/53


Section 4

Backup Solution


40/53

Possible Options

Backup and recovery from tape

Host-based replication

Storage-based replication

Data replication infrastructure

Replicating databases

A comparison of the various disaster recovery

solutions

Metro clusters



41/53

Backup And Recovery From Tape

RAID technology used to provide high levels of data availability

Cannot protect against data loss if the data is deleted (accidental or otherwise) or

corrupted

The tapes can be cloned, i.e., copied to new media to allow them to be stored off-sitein a disaster recovery location

Least expense of all the options

it is only really applicable as the primary disaster recovery mechanism for non-criticalservices, i.e. services with RPOs where data loss and longer RTOs are acceptable



42/53

Host-based replication

The remote mirror software works at the OS kernel level tointercept writes to underlying logical devices as well as to physicaldevices, such as disk slices and hardware RAID protected LUNs

It then forwards these writes on to one or more remote Solaris OS-based nodes connected through an IP-based network

2 modes of data transfer: Synchronous mode replication,Asynchronous mode replication



43/53

Storage-Based Data Replication

Perform data replication on the CPUs or controllers resident in the storage systems.

2 ways- Synchronous and Asynchronous modes, but the software operates at a much lower level.

Consequently, storage-based replication software can replicate data held by applications such as OracleOPS and Oracle RAC even though the I/Os to a single LUN might be issued by several nodesconcurrently.

The software provides remote replication through disk based journaling.

Journaling techniques can improve levels of reliability and robustness in remote copying operations,thereby also providing better data recovery capabilities.



44/53

Data replication infrastructure



45/53

Replicating databases

(RDBMS) portfolios from IBM and Oracle include widerange of tools to manage and administer data held in theirrespective databases: DB2 and Oracle

The RDBMS software is designed to handle logical changesto the underlying data

So, it offers considerably greater flexibility and lowernetwork traffic than a corresponding block-basedreplication solution.



46/53



47/53



48/53

Metro Clusters

The ability to cluster systems across hundreds of kilometers usingDenseWave Division Multiplexors (DWDM) and SAN connected FibreChannel storage devices

Cluster deployments that try to combine availability and disasterrecovery by separating the two halves of the cluster and storagebetween two widely separated data centers

The physically separated cluster nodes work identically but offer theadded benefits of protecting against local disasters and eliminatingthe requirement for a dedicated disaster recovery environment



49/53


Section 5

Costing


50/53

The investments on DR dont increase top-line

revenue, though they will likely let you retain

more of your profits through cost avoidanceand corporate viability.

Building the business case requires a different

approach that calculates the cost of downtime,defines specific requirements, identifies

realistic risks, selects cost-effective

technologies and services, and shows a

commitment to disaster recovery planning and

preparedness as an ongoing program.



51/53

SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING

Implement a continuity management process.

Conduct a business impact analysis (BIA) and risk assessment.

Calculate the cost of downtime.

Develop impact scenarios that address all risks, not just

disasters.

Position DR as a competitive necessity.

Develop a DR services catalog.

Align DR technology investments with other IT initiatives.


Assumption Qty Unit Price (INR) Cost (INR


52/53


Assumption Qty Unit Price (INR) Cost (INR

crores)

Capex

DC site 33% of space in sqft 20,000 25,000 50

Servers 33% of CPUs 2,000 500,000 100

Storage 33% of storage in TB 2,000 400,000 80

Network 10% of server cost 10

Software 15% of storage cost 12

Implementation- Consulting 10% of Capex 20

Total 272

Opex

Bandwidth 100,000 50

Power Rs. 50,000 per kw per

annum, 6 kw per rack

600 300,000 18

Manpower 6 NOC seats, 20 on-site 10

AMC 6% of Capex 15

Total 93


53/53


Thank You

Documents

Disaster Recovery BSS Data Center