Upload
deepak-bhandari
View
219
Download
0
Embed Size (px)
Citation preview
8/8/2019 Disaster Recovery BSS Data Center
1/53
Disaster Recovery for a
BSS Data Center
1DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
2/53
Disaster Recovery: The Lighter Side
2DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
3/53
DR for a BSS Data Centre 3
Section 1
Disaster Recovery Overview
8/8/2019 Disaster Recovery BSS Data Center
4/53
What is a Disaster?
Hazard which has come to realization
Perceived tragedy
Natural calamity
Man-made catastrophe
Disasters are the consequence of
inappropriately managed risks
DR for a BSS Data Centre 4
8/8/2019 Disaster Recovery BSS Data Center
5/53
Risks to be Addressed
5DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
6/53
What is Disaster Recovery in IT
Perspective?
Timely and effective restoration of IT services
in a major incident
Any plan or set of procedures implemented by
a business to maintain uptime and/or prevent
data loss in the event of a system failure
6DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
7/53
Disaster Recovery
People
Staff, Outsourced
Process
Crisis Management
Technology
Hardware, Software
7DR for a BSS Data Centre
IT
8/8/2019 Disaster Recovery BSS Data Center
8/53
Metrics for Disaster Recovery (1/2)
Driven by two metrics
Recovery Time Objective (RTO)
Interrupted for how long?
Recovery Point Objective (RPO)
How much data loss?
8DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
9/53
Metrics for Disaster Recovery (2/2)
DECLARE
DISASTER
10 a.m.
Recovery Point Objectives(RPO)
Recovery Time Objectives(RTO)
RPO: Amount of data lost from failure,
measured as the amount of time from a
disaster event
RTO: Targeted amount of time to restart a
business service after a disaster event
5a.m.
6a.m.
7a.m.
8a.m.
9a.m.
10a.m.
11a.m.
12a.m.
1p.m.
2p.m.
3p.m.
4p.m.
5p.m.
6p.m.
7p.m.
9DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
10/53
Understanding RPO and RTO
Cost of downtime per hour
Employee cost per hour + Cost of problem repair + Cost ofemployee overtime
Loss of customer
Reputation of Company
Recovery Point Objective (RPO)
A point in time to which the data must be recovered
An acceptable loss of data during disaster situation
Recovery Time Objective (RTO) The duration of time within which a business process must be
restored after a disaster (underlying infrastructure andapplication components are restored first)
DR for a BSS Data Centre 10
8/8/2019 Disaster Recovery BSS Data Center
11/53
Investment Scenario
DR for a BSS Data Centre 11
8/8/2019 Disaster Recovery BSS Data Center
12/53
High Availability v/s Disaster Tolerance
High Availability Providing redundancy within a data center to maintain the
service (with or without a short outage) Hardware failures
Software failures Human error
Disaster Tolerance Providing redundancy between data centers to restore the
service quickly (tens of minutes) after certain disasters
(dedicated equipments) Power loss
Fire, flood, earthquake
Sabotage, terrorism
DR for a BSS Data Centre 12
8/8/2019 Disaster Recovery BSS Data Center
13/53
Availability Events (1/2)
Planned Outages Network and power related changes
Hardware repair
Hardware and/or software upgrades Software maintenance
OS
Database
Applications
Data backup and storage management As data grows in size, tape backup is less effective
What data must be archived
How is the data archived?
DR for a BSS Data Centre 13
8/8/2019 Disaster Recovery BSS Data Center
14/53
8/8/2019 Disaster Recovery BSS Data Center
15/53
What causes the most Downtime?
DR for a BSS Data Centre 15
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008
8/8/2019 Disaster Recovery BSS Data Center
16/53
Measure of Availability
DR for a BSS Data Centre 16
Hours of downtime
per yearper IT service
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008
8/8/2019 Disaster Recovery BSS Data Center
17/53
DR for a BSS Data Centre 17
Section 2
Architecture & Sizing for DisasterRecovery
8/8/2019 Disaster Recovery BSS Data Center
18/53
2-Site Architecture
100% Primary Site + 100% DR Site
Database changes are more frequent hence logbased replication of database between Primary
and DR site. Sync replication is not possible because ofWAN
bandwidth
A-synch Replication is possible
RPO -> Depends on how much data to bereplicated,
RTO -> Depends upon People + Processes
DR for a BSS Data Centre 18
8/8/2019 Disaster Recovery BSS Data Center
19/53
2-Site Architecture:Working
19DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
20/53
SANStorage
VolumeGroup
Application
files VG
Archive logs VG
SANStorage
VG
Application
files VG
Archive logs VG
Asynchronous ReplicationStorage Tier
DB Tier
ApplicationTier
DBCI
servers in
Cluster
Application Servers Application Servers
Primary Site
(ACTIVE) DR Site
DBCI
servers in
Cluster
Dark Fiber
20DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
21/53
2-Site Architecture
Advantages
Simple to manage
Less expensive than other solutions
Only one link needs to be procured
Disadvantages
RPO of 15 minutes is not quantifiable (Impact couldbe high or low)
Cannot estimate what kind of data loss will happen
RTO for DR site cannot be quantified to businessbecause of lost transactions.
DR for a BSS Data Centre 21
8/8/2019 Disaster Recovery BSS Data Center
22/53
3 Site Architecture (for RPO=0)
22DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
23/53
For RPO=0 Must have synchronous replication of database
Synchronous replication has limitations on
distance (40 to 60 km) Hence cannot replicate synchronously for long
distances
But can replicate short distances
So a 3 Site ( primary, Near, DR)solution might
achieve RPO=0 (Almost)
DR for a BSS Data Centre 23
8/8/2019 Disaster Recovery BSS Data Center
24/53
What case will RPO be zero
Regional disasters which dont destroy primary
and Near site at the same time.
For all kind of DC failures RPO=0 can be achieved In case of regional disaster which wipes out both
Primary and Near site, RPO will depend upon the
link between Primary and DR( could be 15
minutes depending upon the size of the link)
DR for a BSS Data Centre 24
8/8/2019 Disaster Recovery BSS Data Center
25/53
SAN
StorageVolume
Group
Application
files VG
Archive logs VG
SANStorage
VG
Application
files VG
Archive logs VG
SAN
Storage
VG
Application
files VG
Archive logs VG
Synchronous ReplicationAsynchronous
Replication
DBCI
servers in
Cluster
DBCI
servers in
Cluster
Application Servers Application Servers Application Servers
DBCI
servers
Primary Site Near Site DR Site
WAN link
25DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
26/53
3 SITE ARCHITECTURE:Working
DR for a BSS Data Centre 26
Site A Site B
Distance < 25 kms
Dark Fibre
PROD
Site C DR
Near/ Bunker
8/8/2019 Disaster Recovery BSS Data Center
27/53
3 Site DR considerations
What should a Near site must have
Different & multiple power source/ power grid
Network Termination exactly same as Primary DC
(if Near site has to be used for Primary site
operations)
Replication links from multiple vendors (No SPOF)
Link to DR site
DR for a BSS Data Centre 27
8/8/2019 Disaster Recovery BSS Data Center
28/53
What should be in the Near Site??
Option1 : Full 100 % Replica
of the Primary Site
High cost (Infrastructure +
People0
Servers, storage, firewalls,
switches, backup, power
sources
Applications, Databases, etc
Security, Personnel, Processes
Network Connectivity
Would protect against any
local problems at Primary DC
DR for a BSS Data Centre 28
8/8/2019 Disaster Recovery BSS Data Center
29/53
DR for a BSS Data Centre 29
Option 2: Split Configurationbetween primary and NearSite Database servers split
between primary and NearSite (extended cluster)
When Primary DC failsoperations move to Near Site
Maintenance and continuousupkeep of the of the Near Site
essential Redundancy required in case
of Application Servers,Firewall, routers, Servers,Backup etc
What should be in the Near Site??
8/8/2019 Disaster Recovery BSS Data Center
30/53
DR for a BSS Data Centre 30
Option 3: Minimalist Treat Near site only for
RPO=0 purpose and not foroperations
Replicate storagecontinuously for RPO=0
Keep only that hardwarewhich can push data fromNear sit to DR in case ofprimary DC failure.
Keeps the simplicity of 2 SiteDR which RPO=0 for 3 Site
RPO=0 not achieved ifPrimary and Near Site godown together
What should be in the Near Site??
8/8/2019 Disaster Recovery BSS Data Center
31/53
DR for a BSS Data Centre 31
Section 3
Connectivity to DR Site
8/8/2019 Disaster Recovery BSS Data Center
32/53
Connectivity
The majority of businesses deploy wide area networks (WANs) to
connect the remote parts of the business back to centralized
resources
Bandwidth is always an issue in disaster recovery. If you'rereplicating data for potential failoverboth locally and remotely
then your bandwidth issues become more complicated.
We want to establish a DR site that's far enough away that it won't
be affected by the same disaster, but not so far away that WAN
bandwidth costs will be prohibitive.
32DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
33/53
The physical distance involved will often dictate the type of replication
used to move data between sites.
They are two types of replication:
1) Synchronous replication
2) Asynchronous replication
Synchronous replication moves data in real time so that the data center
and DR site contain the same data moment to moment, but synchronous
data transfers often need high-bandwidth
Asynchronous replication moves data on a bandwidth-available basis.
This allows data movement using cheaper, lower-bandwidth connections,
but presents a possibility of data loss because the data center and DR
site may be out of sync by up to several hours
33DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
34/53
With the popularity ofIP connectivity there are lots of connectivity
options available.Connectivity on SAN can be done by many options
like:-
Ethernet
FC (Fibre Channel)
iSCSI (Internet Small Computer System Interface)
FCIP
FCoE (Fibre Channel over Ethernet)
The sites can be connected by a VPN, which provides cost benefits
1) Ethernet
Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre
Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet
(10GigE) is widely available for data centers
2) Fibre Channel
Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned untilrecently. Today, 4 Gbps FC is readily available and 10 Gbps implementations areappearing on some high-end systems and director-class switches.
34DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
35/53
3) iSCI (Internet Small Computer System Interface)
iSCSI to transfer data over LANs,WANs or the Internet and supports storage management
over long distances.
The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP
packets for transmission over an Ethernet connection, rather than a Fibre Channelconnection.
iSCSI still has two disadvantages for storage:-
At 1 GigE, it does not perform as fast as Fibre Channel.
And Ethernet will drop packets during network congestion.
These problems may be alleviated soon, thanks to the emergence of 10 GigE and DataCenter Ethernet
4) FCIP
. FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged
between distant Fibre Channel SANs. It's important to note that FCIP only works to connect
Fibre Channel SANs, but iSCSI can run on any Ethernet network.
5) FCoE
Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable
SAN and LAN convergence
35DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
36/53
Requirements
To establish WAN connectivity between the Central Location to 2 remote
locations for Data Transfer Application.
The leased line based network design primarily to be used for
implementing the Online Data Transfer Application with the auto ISDN
backup connectivity.
The connectivity from the Central Location to the remote locations at64Kbps to 2 Mbps speed.
The connectivity to be always on.
The Network Devices to be SNMP managed.
Provision for future scalability.
36DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
37/53
37DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
38/53
DAX Network
Central Location:
At the Central location, Dax recommended the customer to opt for 1 no. of DX-
2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module
Support.
The router was populated as follows:
Slot 1 2-ports Sync/Async Serial Module (speed up to : 2Mbps)
Slot 2 4-port ISDN U module.
Remaining 2 slots were left free for future scalability.
Remote Location:
At the Remote location, Dax recommended each remote branch to use DX-1721
Modular Router with 1# 10/100 port and 4 WAN Slot forWAN/VOIP modules.
Each DX1721 was loaded with the following modules:
Slot 1 - ISDN S/T module for providing automatic back-up connectivity.
Slot 2 - 1-PortHigh speed Serial Sync / Async WAN Interface module for
connecting leased line link @ 64 Kbps up to 2 Mbps Speed.
The remaining 2 slots were left free for future scalability.
38DR for a BSS Data Centre
8/8/2019 Disaster Recovery BSS Data Center
39/53
DR for a BSS Data Centre 39
Section 4
Backup Solution
8/8/2019 Disaster Recovery BSS Data Center
40/53
Possible Options
Backup and recovery from tape
Host-based replication
Storage-based replication
Data replication infrastructure
Replicating databases
A comparison of the various disaster recovery
solutions
Metro clusters
DR for a BSS Data Centre 40
8/8/2019 Disaster Recovery BSS Data Center
41/53
Backup And Recovery From Tape
RAID technology used to provide high levels of data availability
Cannot protect against data loss if the data is deleted (accidental or otherwise) or
corrupted
The tapes can be cloned, i.e., copied to new media to allow them to be stored off-sitein a disaster recovery location
Least expense of all the options
it is only really applicable as the primary disaster recovery mechanism for non-criticalservices, i.e. services with RPOs where data loss and longer RTOs are acceptable
DR for a BSS Data Centre 41
8/8/2019 Disaster Recovery BSS Data Center
42/53
Host-based replication
The remote mirror software works at the OS kernel level tointercept writes to underlying logical devices as well as to physicaldevices, such as disk slices and hardware RAID protected LUNs
It then forwards these writes on to one or more remote Solaris OS-based nodes connected through an IP-based network
2 modes of data transfer: Synchronous mode replication,Asynchronous mode replication
DR for a BSS Data Centre 42
8/8/2019 Disaster Recovery BSS Data Center
43/53
Storage-Based Data Replication
Perform data replication on the CPUs or controllers resident in the storage systems.
2 ways- Synchronous and Asynchronous modes, but the software operates at a much lower level.
Consequently, storage-based replication software can replicate data held by applications such as OracleOPS and Oracle RAC even though the I/Os to a single LUN might be issued by several nodesconcurrently.
The software provides remote replication through disk based journaling.
Journaling techniques can improve levels of reliability and robustness in remote copying operations,thereby also providing better data recovery capabilities.
DR for a BSS Data Centre 43
8/8/2019 Disaster Recovery BSS Data Center
44/53
Data replication infrastructure
DR for a BSS Data Centre 44
8/8/2019 Disaster Recovery BSS Data Center
45/53
Replicating databases
(RDBMS) portfolios from IBM and Oracle include widerange of tools to manage and administer data held in theirrespective databases: DB2 and Oracle
The RDBMS software is designed to handle logical changesto the underlying data
So, it offers considerably greater flexibility and lowernetwork traffic than a corresponding block-basedreplication solution.
DR for a BSS Data Centre 45
8/8/2019 Disaster Recovery BSS Data Center
46/53
DR for a BSS Data Centre 46
8/8/2019 Disaster Recovery BSS Data Center
47/53
DR for a BSS Data Centre 47
8/8/2019 Disaster Recovery BSS Data Center
48/53
Metro Clusters
The ability to cluster systems across hundreds of kilometers usingDenseWave Division Multiplexors (DWDM) and SAN connected FibreChannel storage devices
Cluster deployments that try to combine availability and disasterrecovery by separating the two halves of the cluster and storagebetween two widely separated data centers
The physically separated cluster nodes work identically but offer theadded benefits of protecting against local disasters and eliminatingthe requirement for a dedicated disaster recovery environment
DR for a BSS Data Centre 48
8/8/2019 Disaster Recovery BSS Data Center
49/53
DR for a BSS Data Centre 49
Section 5
Costing
8/8/2019 Disaster Recovery BSS Data Center
50/53
The investments on DR dont increase top-line
revenue, though they will likely let you retain
more of your profits through cost avoidanceand corporate viability.
Building the business case requires a different
approach that calculates the cost of downtime,defines specific requirements, identifies
realistic risks, selects cost-effective
technologies and services, and shows a
commitment to disaster recovery planning and
preparedness as an ongoing program.
DR for a BSS Data Centre 50
8/8/2019 Disaster Recovery BSS Data Center
51/53
SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING
Implement a continuity management process.
Conduct a business impact analysis (BIA) and risk assessment.
Calculate the cost of downtime.
Develop impact scenarios that address all risks, not just
disasters.
Position DR as a competitive necessity.
Develop a DR services catalog.
Align DR technology investments with other IT initiatives.
DR for a BSS Data Centre 51
Assumption Qty Unit Price (INR) Cost (INR
8/8/2019 Disaster Recovery BSS Data Center
52/53
DR for a BSS Data Centre 52
Assumption Qty Unit Price (INR) Cost (INR
crores)
Capex
DC site 33% of space in sqft 20,000 25,000 50
Servers 33% of CPUs 2,000 500,000 100
Storage 33% of storage in TB 2,000 400,000 80
Network 10% of server cost 10
Software 15% of storage cost 12
Implementation- Consulting 10% of Capex 20
Total 272
Opex
Bandwidth 100,000 50
Power Rs. 50,000 per kw per
annum, 6 kw per rack
600 300,000 18
Manpower 6 NOC seats, 20 on-site 10
AMC 6% of Capex 15
Total 93
8/8/2019 Disaster Recovery BSS Data Center
53/53
DR for a BSS Data Centre 53
Thank You