Disaster Recovery BSS Data Center

Embed Size (px)

Citation preview

  • 8/8/2019 Disaster Recovery BSS Data Center

    1/53

    Disaster Recovery for a

    BSS Data Center

    1DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    2/53

    Disaster Recovery: The Lighter Side

    2DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    3/53

    DR for a BSS Data Centre 3

    Section 1

    Disaster Recovery Overview

  • 8/8/2019 Disaster Recovery BSS Data Center

    4/53

    What is a Disaster?

    Hazard which has come to realization

    Perceived tragedy

    Natural calamity

    Man-made catastrophe

    Disasters are the consequence of

    inappropriately managed risks

    DR for a BSS Data Centre 4

  • 8/8/2019 Disaster Recovery BSS Data Center

    5/53

    Risks to be Addressed

    5DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    6/53

    What is Disaster Recovery in IT

    Perspective?

    Timely and effective restoration of IT services

    in a major incident

    Any plan or set of procedures implemented by

    a business to maintain uptime and/or prevent

    data loss in the event of a system failure

    6DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    7/53

    Disaster Recovery

    People

    Staff, Outsourced

    Process

    Crisis Management

    Technology

    Hardware, Software

    7DR for a BSS Data Centre

    IT

  • 8/8/2019 Disaster Recovery BSS Data Center

    8/53

    Metrics for Disaster Recovery (1/2)

    Driven by two metrics

    Recovery Time Objective (RTO)

    Interrupted for how long?

    Recovery Point Objective (RPO)

    How much data loss?

    8DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    9/53

    Metrics for Disaster Recovery (2/2)

    DECLARE

    DISASTER

    10 a.m.

    Recovery Point Objectives(RPO)

    Recovery Time Objectives(RTO)

    RPO: Amount of data lost from failure,

    measured as the amount of time from a

    disaster event

    RTO: Targeted amount of time to restart a

    business service after a disaster event

    5a.m.

    6a.m.

    7a.m.

    8a.m.

    9a.m.

    10a.m.

    11a.m.

    12a.m.

    1p.m.

    2p.m.

    3p.m.

    4p.m.

    5p.m.

    6p.m.

    7p.m.

    9DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    10/53

    Understanding RPO and RTO

    Cost of downtime per hour

    Employee cost per hour + Cost of problem repair + Cost ofemployee overtime

    Loss of customer

    Reputation of Company

    Recovery Point Objective (RPO)

    A point in time to which the data must be recovered

    An acceptable loss of data during disaster situation

    Recovery Time Objective (RTO) The duration of time within which a business process must be

    restored after a disaster (underlying infrastructure andapplication components are restored first)

    DR for a BSS Data Centre 10

  • 8/8/2019 Disaster Recovery BSS Data Center

    11/53

    Investment Scenario

    DR for a BSS Data Centre 11

  • 8/8/2019 Disaster Recovery BSS Data Center

    12/53

    High Availability v/s Disaster Tolerance

    High Availability Providing redundancy within a data center to maintain the

    service (with or without a short outage) Hardware failures

    Software failures Human error

    Disaster Tolerance Providing redundancy between data centers to restore the

    service quickly (tens of minutes) after certain disasters

    (dedicated equipments) Power loss

    Fire, flood, earthquake

    Sabotage, terrorism

    DR for a BSS Data Centre 12

  • 8/8/2019 Disaster Recovery BSS Data Center

    13/53

    Availability Events (1/2)

    Planned Outages Network and power related changes

    Hardware repair

    Hardware and/or software upgrades Software maintenance

    OS

    Database

    Applications

    Data backup and storage management As data grows in size, tape backup is less effective

    What data must be archived

    How is the data archived?

    DR for a BSS Data Centre 13

  • 8/8/2019 Disaster Recovery BSS Data Center

    14/53

  • 8/8/2019 Disaster Recovery BSS Data Center

    15/53

    What causes the most Downtime?

    DR for a BSS Data Centre 15

    Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

  • 8/8/2019 Disaster Recovery BSS Data Center

    16/53

    Measure of Availability

    DR for a BSS Data Centre 16

    Hours of downtime

    per yearper IT service

    Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

  • 8/8/2019 Disaster Recovery BSS Data Center

    17/53

    DR for a BSS Data Centre 17

    Section 2

    Architecture & Sizing for DisasterRecovery

  • 8/8/2019 Disaster Recovery BSS Data Center

    18/53

    2-Site Architecture

    100% Primary Site + 100% DR Site

    Database changes are more frequent hence logbased replication of database between Primary

    and DR site. Sync replication is not possible because ofWAN

    bandwidth

    A-synch Replication is possible

    RPO -> Depends on how much data to bereplicated,

    RTO -> Depends upon People + Processes

    DR for a BSS Data Centre 18

  • 8/8/2019 Disaster Recovery BSS Data Center

    19/53

    2-Site Architecture:Working

    19DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    20/53

    SANStorage

    VolumeGroup

    Application

    files VG

    Archive logs VG

    SANStorage

    VG

    Application

    files VG

    Archive logs VG

    Asynchronous ReplicationStorage Tier

    DB Tier

    ApplicationTier

    DBCI

    servers in

    Cluster

    Application Servers Application Servers

    Primary Site

    (ACTIVE) DR Site

    DBCI

    servers in

    Cluster

    Dark Fiber

    20DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    21/53

    2-Site Architecture

    Advantages

    Simple to manage

    Less expensive than other solutions

    Only one link needs to be procured

    Disadvantages

    RPO of 15 minutes is not quantifiable (Impact couldbe high or low)

    Cannot estimate what kind of data loss will happen

    RTO for DR site cannot be quantified to businessbecause of lost transactions.

    DR for a BSS Data Centre 21

  • 8/8/2019 Disaster Recovery BSS Data Center

    22/53

    3 Site Architecture (for RPO=0)

    22DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    23/53

    For RPO=0 Must have synchronous replication of database

    Synchronous replication has limitations on

    distance (40 to 60 km) Hence cannot replicate synchronously for long

    distances

    But can replicate short distances

    So a 3 Site ( primary, Near, DR)solution might

    achieve RPO=0 (Almost)

    DR for a BSS Data Centre 23

  • 8/8/2019 Disaster Recovery BSS Data Center

    24/53

    What case will RPO be zero

    Regional disasters which dont destroy primary

    and Near site at the same time.

    For all kind of DC failures RPO=0 can be achieved In case of regional disaster which wipes out both

    Primary and Near site, RPO will depend upon the

    link between Primary and DR( could be 15

    minutes depending upon the size of the link)

    DR for a BSS Data Centre 24

  • 8/8/2019 Disaster Recovery BSS Data Center

    25/53

    SAN

    StorageVolume

    Group

    Application

    files VG

    Archive logs VG

    SANStorage

    VG

    Application

    files VG

    Archive logs VG

    SAN

    Storage

    VG

    Application

    files VG

    Archive logs VG

    Synchronous ReplicationAsynchronous

    Replication

    DBCI

    servers in

    Cluster

    DBCI

    servers in

    Cluster

    Application Servers Application Servers Application Servers

    DBCI

    servers

    Primary Site Near Site DR Site

    WAN link

    25DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    26/53

    3 SITE ARCHITECTURE:Working

    DR for a BSS Data Centre 26

    Site A Site B

    Distance < 25 kms

    Dark Fibre

    PROD

    Site C DR

    Near/ Bunker

  • 8/8/2019 Disaster Recovery BSS Data Center

    27/53

    3 Site DR considerations

    What should a Near site must have

    Different & multiple power source/ power grid

    Network Termination exactly same as Primary DC

    (if Near site has to be used for Primary site

    operations)

    Replication links from multiple vendors (No SPOF)

    Link to DR site

    DR for a BSS Data Centre 27

  • 8/8/2019 Disaster Recovery BSS Data Center

    28/53

    What should be in the Near Site??

    Option1 : Full 100 % Replica

    of the Primary Site

    High cost (Infrastructure +

    People0

    Servers, storage, firewalls,

    switches, backup, power

    sources

    Applications, Databases, etc

    Security, Personnel, Processes

    Network Connectivity

    Would protect against any

    local problems at Primary DC

    DR for a BSS Data Centre 28

  • 8/8/2019 Disaster Recovery BSS Data Center

    29/53

    DR for a BSS Data Centre 29

    Option 2: Split Configurationbetween primary and NearSite Database servers split

    between primary and NearSite (extended cluster)

    When Primary DC failsoperations move to Near Site

    Maintenance and continuousupkeep of the of the Near Site

    essential Redundancy required in case

    of Application Servers,Firewall, routers, Servers,Backup etc

    What should be in the Near Site??

  • 8/8/2019 Disaster Recovery BSS Data Center

    30/53

    DR for a BSS Data Centre 30

    Option 3: Minimalist Treat Near site only for

    RPO=0 purpose and not foroperations

    Replicate storagecontinuously for RPO=0

    Keep only that hardwarewhich can push data fromNear sit to DR in case ofprimary DC failure.

    Keeps the simplicity of 2 SiteDR which RPO=0 for 3 Site

    RPO=0 not achieved ifPrimary and Near Site godown together

    What should be in the Near Site??

  • 8/8/2019 Disaster Recovery BSS Data Center

    31/53

    DR for a BSS Data Centre 31

    Section 3

    Connectivity to DR Site

  • 8/8/2019 Disaster Recovery BSS Data Center

    32/53

    Connectivity

    The majority of businesses deploy wide area networks (WANs) to

    connect the remote parts of the business back to centralized

    resources

    Bandwidth is always an issue in disaster recovery. If you'rereplicating data for potential failoverboth locally and remotely

    then your bandwidth issues become more complicated.

    We want to establish a DR site that's far enough away that it won't

    be affected by the same disaster, but not so far away that WAN

    bandwidth costs will be prohibitive.

    32DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    33/53

    The physical distance involved will often dictate the type of replication

    used to move data between sites.

    They are two types of replication:

    1) Synchronous replication

    2) Asynchronous replication

    Synchronous replication moves data in real time so that the data center

    and DR site contain the same data moment to moment, but synchronous

    data transfers often need high-bandwidth

    Asynchronous replication moves data on a bandwidth-available basis.

    This allows data movement using cheaper, lower-bandwidth connections,

    but presents a possibility of data loss because the data center and DR

    site may be out of sync by up to several hours

    33DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    34/53

    With the popularity ofIP connectivity there are lots of connectivity

    options available.Connectivity on SAN can be done by many options

    like:-

    Ethernet

    FC (Fibre Channel)

    iSCSI (Internet Small Computer System Interface)

    FCIP

    FCoE (Fibre Channel over Ethernet)

    The sites can be connected by a VPN, which provides cost benefits

    1) Ethernet

    Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre

    Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet

    (10GigE) is widely available for data centers

    2) Fibre Channel

    Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned untilrecently. Today, 4 Gbps FC is readily available and 10 Gbps implementations areappearing on some high-end systems and director-class switches.

    34DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    35/53

    3) iSCI (Internet Small Computer System Interface)

    iSCSI to transfer data over LANs,WANs or the Internet and supports storage management

    over long distances.

    The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP

    packets for transmission over an Ethernet connection, rather than a Fibre Channelconnection.

    iSCSI still has two disadvantages for storage:-

    At 1 GigE, it does not perform as fast as Fibre Channel.

    And Ethernet will drop packets during network congestion.

    These problems may be alleviated soon, thanks to the emergence of 10 GigE and DataCenter Ethernet

    4) FCIP

    . FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged

    between distant Fibre Channel SANs. It's important to note that FCIP only works to connect

    Fibre Channel SANs, but iSCSI can run on any Ethernet network.

    5) FCoE

    Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable

    SAN and LAN convergence

    35DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    36/53

    Requirements

    To establish WAN connectivity between the Central Location to 2 remote

    locations for Data Transfer Application.

    The leased line based network design primarily to be used for

    implementing the Online Data Transfer Application with the auto ISDN

    backup connectivity.

    The connectivity from the Central Location to the remote locations at64Kbps to 2 Mbps speed.

    The connectivity to be always on.

    The Network Devices to be SNMP managed.

    Provision for future scalability.

    36DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    37/53

    37DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    38/53

    DAX Network

    Central Location:

    At the Central location, Dax recommended the customer to opt for 1 no. of DX-

    2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module

    Support.

    The router was populated as follows:

    Slot 1 2-ports Sync/Async Serial Module (speed up to : 2Mbps)

    Slot 2 4-port ISDN U module.

    Remaining 2 slots were left free for future scalability.

    Remote Location:

    At the Remote location, Dax recommended each remote branch to use DX-1721

    Modular Router with 1# 10/100 port and 4 WAN Slot forWAN/VOIP modules.

    Each DX1721 was loaded with the following modules:

    Slot 1 - ISDN S/T module for providing automatic back-up connectivity.

    Slot 2 - 1-PortHigh speed Serial Sync / Async WAN Interface module for

    connecting leased line link @ 64 Kbps up to 2 Mbps Speed.

    The remaining 2 slots were left free for future scalability.

    38DR for a BSS Data Centre

  • 8/8/2019 Disaster Recovery BSS Data Center

    39/53

    DR for a BSS Data Centre 39

    Section 4

    Backup Solution

  • 8/8/2019 Disaster Recovery BSS Data Center

    40/53

    Possible Options

    Backup and recovery from tape

    Host-based replication

    Storage-based replication

    Data replication infrastructure

    Replicating databases

    A comparison of the various disaster recovery

    solutions

    Metro clusters

    DR for a BSS Data Centre 40

  • 8/8/2019 Disaster Recovery BSS Data Center

    41/53

    Backup And Recovery From Tape

    RAID technology used to provide high levels of data availability

    Cannot protect against data loss if the data is deleted (accidental or otherwise) or

    corrupted

    The tapes can be cloned, i.e., copied to new media to allow them to be stored off-sitein a disaster recovery location

    Least expense of all the options

    it is only really applicable as the primary disaster recovery mechanism for non-criticalservices, i.e. services with RPOs where data loss and longer RTOs are acceptable

    DR for a BSS Data Centre 41

  • 8/8/2019 Disaster Recovery BSS Data Center

    42/53

    Host-based replication

    The remote mirror software works at the OS kernel level tointercept writes to underlying logical devices as well as to physicaldevices, such as disk slices and hardware RAID protected LUNs

    It then forwards these writes on to one or more remote Solaris OS-based nodes connected through an IP-based network

    2 modes of data transfer: Synchronous mode replication,Asynchronous mode replication

    DR for a BSS Data Centre 42

  • 8/8/2019 Disaster Recovery BSS Data Center

    43/53

    Storage-Based Data Replication

    Perform data replication on the CPUs or controllers resident in the storage systems.

    2 ways- Synchronous and Asynchronous modes, but the software operates at a much lower level.

    Consequently, storage-based replication software can replicate data held by applications such as OracleOPS and Oracle RAC even though the I/Os to a single LUN might be issued by several nodesconcurrently.

    The software provides remote replication through disk based journaling.

    Journaling techniques can improve levels of reliability and robustness in remote copying operations,thereby also providing better data recovery capabilities.

    DR for a BSS Data Centre 43

  • 8/8/2019 Disaster Recovery BSS Data Center

    44/53

    Data replication infrastructure

    DR for a BSS Data Centre 44

  • 8/8/2019 Disaster Recovery BSS Data Center

    45/53

    Replicating databases

    (RDBMS) portfolios from IBM and Oracle include widerange of tools to manage and administer data held in theirrespective databases: DB2 and Oracle

    The RDBMS software is designed to handle logical changesto the underlying data

    So, it offers considerably greater flexibility and lowernetwork traffic than a corresponding block-basedreplication solution.

    DR for a BSS Data Centre 45

  • 8/8/2019 Disaster Recovery BSS Data Center

    46/53

    DR for a BSS Data Centre 46

  • 8/8/2019 Disaster Recovery BSS Data Center

    47/53

    DR for a BSS Data Centre 47

  • 8/8/2019 Disaster Recovery BSS Data Center

    48/53

    Metro Clusters

    The ability to cluster systems across hundreds of kilometers usingDenseWave Division Multiplexors (DWDM) and SAN connected FibreChannel storage devices

    Cluster deployments that try to combine availability and disasterrecovery by separating the two halves of the cluster and storagebetween two widely separated data centers

    The physically separated cluster nodes work identically but offer theadded benefits of protecting against local disasters and eliminatingthe requirement for a dedicated disaster recovery environment

    DR for a BSS Data Centre 48

  • 8/8/2019 Disaster Recovery BSS Data Center

    49/53

    DR for a BSS Data Centre 49

    Section 5

    Costing

  • 8/8/2019 Disaster Recovery BSS Data Center

    50/53

    The investments on DR dont increase top-line

    revenue, though they will likely let you retain

    more of your profits through cost avoidanceand corporate viability.

    Building the business case requires a different

    approach that calculates the cost of downtime,defines specific requirements, identifies

    realistic risks, selects cost-effective

    technologies and services, and shows a

    commitment to disaster recovery planning and

    preparedness as an ongoing program.

    DR for a BSS Data Centre 50

  • 8/8/2019 Disaster Recovery BSS Data Center

    51/53

    SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING

    Implement a continuity management process.

    Conduct a business impact analysis (BIA) and risk assessment.

    Calculate the cost of downtime.

    Develop impact scenarios that address all risks, not just

    disasters.

    Position DR as a competitive necessity.

    Develop a DR services catalog.

    Align DR technology investments with other IT initiatives.

    DR for a BSS Data Centre 51

    Assumption Qty Unit Price (INR) Cost (INR

  • 8/8/2019 Disaster Recovery BSS Data Center

    52/53

    DR for a BSS Data Centre 52

    Assumption Qty Unit Price (INR) Cost (INR

    crores)

    Capex

    DC site 33% of space in sqft 20,000 25,000 50

    Servers 33% of CPUs 2,000 500,000 100

    Storage 33% of storage in TB 2,000 400,000 80

    Network 10% of server cost 10

    Software 15% of storage cost 12

    Implementation- Consulting 10% of Capex 20

    Total 272

    Opex

    Bandwidth 100,000 50

    Power Rs. 50,000 per kw per

    annum, 6 kw per rack

    600 300,000 18

    Manpower 6 NOC seats, 20 on-site 10

    AMC 6% of Capex 15

    Total 93

  • 8/8/2019 Disaster Recovery BSS Data Center

    53/53

    DR for a BSS Data Centre 53

    Thank You