Database Disaster Recovery
Embed Size (px)
DESCRIPTION
Database Disaster Recorvery plan
Citation preview
Database Disaster Recovery*
Agenda
Customer Challenges
NetApp Confidential - Limited Use
Let’s first understand the customer challenges.
Challenging economic times are compelling businesses to achieve
even greater levels of cost savings and operational efficiency. Yet
business-critical applications still require vital data to be
protected and available to meet increasing service-level demands.
The majority of businesses that fail to protect their critical
applications do not get a second chance, and those that fail to
reduce their operational expenses may suffer the same fate.
This chart represents the results of an online survey conducted by
Forester Research and Disaster Recovery Journal that included
responses from 250 IT decision makers.
The key takeaways:
The overwhelming majority of declared disasters are caused by
internal data center failures. In fact, more than 75% of those
surveyed have experienced an outage inside the data center. These
common causes of downtime include things such as power, hardware,
and network. While, the more catastrophic disasters such as flood,
hurricane, and fire that affect the entire data center are rare
occurrences.
*
Data Availability & Data Protection solution for Oracle
Automates the creation and maintenance of one or more synchronized
copies of the primary database
If the primary database becomes unavailable, a standby database can
easily assume the primary role
Standby databases can be used for queries, reporting, testing, or
backups while in standby role
Feature of Oracle Database Enterprise Edition (EE)
What is Data Guard?
Oracle Data Guard Architecture
Online
Redo
Logs
7
On the primary database, Data Guard uses a specialized background
process, called LogWriter Network Server (LNS), to capture redo
data being written by Log Writer and synchronously or
asynchronously transmit the redo data to the standby database. The
LNS process isolates Log Writer from the overhead of transmission
and from network disruptions.
Synchronous redo transport (SYNC) requires the Log Writer on the
primary database to wait for acknowledgement from LNS that the
standby has received the redo data and written it to a standby redo
log before it can acknowledge the commit to the client application.
This insures all committed transactions are on disk and protected
at the standby location.
Asynchronous redo transport (ASYNC) does not require the Log Writer
on the primary to wait for acknowledgment from the standby database
that redo has been written to disk; commits are acknowledged to the
client application asynchronous of redo transport.
In addition, ASYNC very efficiently streams redo to the standby
database to eliminate any overhead caused by network acknowledgment
that would otherwise reduce network throughput.
In cases where the primary and standby databases become
disconnected (network failures or standby server failures), and
depending upon the protection mode chosen (protection modes are
discussed later in this paper), the primary database will continue
to process transactions and accumulate a backlog of redo data that
cannot be shipped to the standby until a new network connection can
be established (referred to as an archive log gap). While in this
state, Data Guard continually monitors standby database status,
detects when connection is re-established, and automatically
resynchronizes the standby database with the primary to return the
configuration to a protected state as fast as possible.
© 2008 NetApp. All rights reserved.
Data Guard Redo Apply
Physical Standby Database is a block-for-block copy of the primary
database
Uses the database recovery functionality to apply changes
Can be opened in read-only mode while apply is active for
reporting/queries
Can also be used for backups, offloading production database
Primary
Database
Data Guard SQL Apply
Contains the same logical information (rows) as the production
database
Physical organization and structure can be very different
Can host multiple schemas
Can be queried for reports while logs are being applied via
SQL
Can create additional indexes and materialized views for better
query performance
Additional
Indexes &
Maintains a physical, block-for-block copy of the primary
Can be open for read-only queries
At role transition, offers the assurance that the standby database
chosen to be the new primary has not been changed compared to the
old primary
Can be used for backups
Faster, since it bypasses the SQL transformation layer
Maintains a logical, transaction-for-transaction copy of the
primary
Allows creation of additional objects, modification of
objects
Possible to skip apply on certain objects
Can be used as a good reporting solution – supports real-time
reporting in 10g
Has datatype restrictions
Flexible Data Protection Modes
Protection Mode
Synchronous redo shipping
Synchronous redo shipping
Asynchronous redo shipping
Switchover and Failover
Switchover
Failover
Use Flashback Database to reinstante original primary
Manually execute via simple SQL / GUI interface, or
Automate failover using Data Guard Fast-Start Failover
*
An Oracle database operates in one of two roles: primary or
standby. Data Guard helps
you change the role of a database using either a switchover or a
failover:
A switchover is a role reversal between the primary database and
one of its
standby databases. A switchover guarantees no data loss and is
typically done for
planned maintenance of the primary system. During a switchover, the
primary
database transitions to a standby role, and the standby database
transitions to the
primary role.
A failover is done when the primary database (all instances of an
Oracle RAC
primary database) fails or has become unreachable and one of the
standby
databases is transitioned to take over the primary role. Failover
should be
performed when the primary database cannot be recovered in a timely
manner.
Failover may or may not result in data loss depending on the
protection mode in
effect at the time of the failover.
Without the broker, you perform role transitions by first
determining if a role
transition is necessary and then issuing a series of SQL statements
(as described in
Oracle Data Guard Concepts and Administration). The broker
simplifies switchovers and
failovers by allowing you to invoke them using a single key click
in Oracle Enterprise
Manager or a single command in the DGMGRL command-line interface
(referred to in
this documentation as manual failover). Moreover, you can enable
fast-start failover to
fail over automatically when the conditions for fast-start failover
are met. When
fast-start failover is enabled, the broker determines if a failover
is necessary and
initiates the failover to the specified target standby database
automatically, with no
need for DBA intervention.
Fast-start failover allows you to increase availability with less
need for manual
intervention, thereby reducing management costs. Manual failover
gives you control
over exactly when a failover occurs and to which target standby
database. Regardlessof the method you choose, the broker
coordinates the role transition on all databases in
the configuration.
When the database is opened for the first time after a role
transition, the DB_ROLE_
CHANGE system event fires. You can write a trigger that's
associated with this system
event to manage tasks after a role change occurs. See the table of
system manager
events in Oracle Database Advanced Application Developer's Guide
for more details. After
a failover, the broker posts the DB_DOWN HA event in addition to
firing the DB_ROLE_
CHANGE system event. You may use both the DB_DOWN HA event and the
DB_ROLE_
CHANGE system event to, for example, help user applications locate
services on the
new primary database.
Data Guard Management Interfaces
SQL*Plus Command Line
Data Guard Broker
DGMGRL Command Line
DMON process running on all databases in a Data Guard config
Simpler, single statement commands that perform the work of
multiple SQL*Plus commands
Attach to any member of a Data Guard configuration and manage all
members as a single configuration
Enterprise Manager Grid Control
No separate license purchase required
© 2008 NetApp. All rights reserved.
*
Snapshot Standby
Increase ROI
Preserves zero data loss – continuous redo transport while open
read-write
Truly leverages standby database and DR hardware for multiple
purposes
Similar to storage snapshots, but provides DR at the same time and
uses
single copy of storage
*
*
A Data Guard 10.2 physical standby can be open read-write (for
testing / cloning), but it can’t continue to receive redo from the
primary
A physical standby thus opened is a regular read/write database
with no connection to the primary
It can be converted back to be a physical standby
After conversion back to a physical standby, redo accumulated on
the primary has to be sent across the network to re-sync
Data can be lost if a failure occurs before re-sync is
complete
Data Guard 11g Snapshot Standby solves this problem
© 2008 NetApp. All rights reserved.
Applications, backups, reports run on production only
Traditional Physical Standby Databases
Real-time
Queries
Standby
Database
Production
Database
Simple, fast, supports all data types and applications
In Data Guard 10g
Can be open read-only, but Redo Apply has to stop
Latest data is not available for query or reports
Also prolongs switchover / failover
Oracle Active Data Guard 11g – a new Database Option
Real-time Query enables read-only access to a physical standby
database while Redo Apply is active
© 2008 NetApp. All rights reserved.
Offload read-only queries to physical standby
Offload fast incremental backups to physical standby
Active Data Guard 11g
Increase ROI - Real-time Query
Supports RAC on primary and/or standby
Queries see transactionally consistent results
Handles all data types, but not as flexible as logical
standby
In read-only standby, can not write to TEMP tables in the current
release
If auditing is required, must audit to O/S
For applications that need to write result tables, login info etc,
create a stub database and from it link to standby.
RMAN block change tracking on standby database
Fast incremental backups complete 20x faster
© 2008 NetApp. All rights reserved.
Active Data Guard Benefits
Simultaneous read & recovery
Complex replication used to create reporting replica
Performance protection
*
*
So now you can see how Oracle Database 11g with Active Data Guard
provides new ways to enhance the Quality of Service for mission
critical applications.
Only Active Data Guard can continue to apply primary database
changes to an identical replica that can be simultaneously be used
for real-time query or reporting, using a process that is as simple
to implement with such high performance. This will make standby
databases a more integral part of data centers driven as much by
the desire to enhance the quality of service of the production
database as by concerns for data protection and availability.
Simplicity, High Performance and Reliability – are key
characteristics of Active Data Guard that differentiate it from
traditional replication methods.
© 2008 NetApp. All rights reserved.
Microsoft High Availability
Failover clustering
Failover clustering provides high-availability support for an
entire instance of SQL Server. A failover cluster is a combination
of one or more nodes, or servers, with two or more shared disks.
Applications such as SQL Server and Notification Services are each
installed into a Microsoft Cluster Service (MSCS) cluster group,
known as a resource group.
A failover cluster does not protect against disk failure. You can
use failover clustering to reduce system downtime and provide
higher application availability. Failover clustering is supported
in SQL Server 2005 Enterprise Edition, Developer Edition and, with
some restrictions, Standard Edition. For more information about
failover clustering, see Failover Clustering and Installing a
Failover Cluster.
Database mirroring
Database mirroring is primarily a software solution to increase
database availability by supporting almost instantaneous failover.
Database mirroring can be used to maintain a single standby
database, or mirror database, for a corresponding production
database that is referred to as the principal database.
Each database mirroring configuration involves a principal server
that contains the principal database, and a mirror server that
contains the mirror database. The mirror server continuously brings
the mirror database up to date with the principal database.
Log shipping
Like database mirroring, log shipping operates at the database
level. Log shipping can be used to maintain one or more warm
standby databases, referred to as secondary databases, for a
corresponding production database that is referred to as the
primary database. Each secondary database is created by restoring a
database backup of the primary database with no recovery, or with
standby. Restoring with standby permits the resulting secondary
database to be used for limited reporting purposes.
Before a failover can occur, a secondary database must be brought
fully up-to-date by manually applying any unrestored log
backups.
Log shipping provides the flexibility of supporting multiple
standby databases. If you require multiple standby databases, you
can use log shipping alone or as a supplement to database
mirroring. When these solutions are used together, the current
principal database of the database mirroring configuration is also
the current primary database of the log shipping
configuration.
Replication
Replication uses a publish-subscribe model, allowing a primary
server, referred to as the Publisher, to distribute data to one or
more secondary servers, or Subscribers. Replication allows
real-time availability and scalability across these servers. It
supports filtering to provide a subset of data at Subscribers, and
also allows partitioned updates. Subscribers are online and
available for reporting or other functions, without query recovery.
SQL Server offers three types of replication: snapshot,
transactional, and merge. Transactional replication provides the
lowest latency and is most commonly used for high availability. For
more information, see Improving Scalability and Availability.
© 2008 NetApp. All rights reserved.
Microsoft SQL Server Database Mirroring
Benefits
Operating modes
How Database Mirroring Works
The principal and mirror servers communicate and cooperate as
partners in a database mirroring session. The two partners perform
complementary roles in the session: the principal role and the
mirror role. At any given time, one partner performs the principal
role, and the other partner performs the mirror role. Each partner
is described as owning its current role. The partner that owns the
principal role is known as the principal server, and its copy of
the database is the current principal database. The partner that
owns the mirror role is known as the mirror server, and its copy of
the database is the current mirror database. When database
mirroring is deployed in a production environment, the principal
database is the production database.
Database mirroring involves redoing every insert, update, and
delete operation that occurs on the principal database onto the
mirror database as quickly as possible. Redoing is accomplished by
sending a stream of active transaction log records to the mirror
server, which applies log records to the mirror database, in
sequence, as quickly as possible. Unlike replication, which works
at the logical level, database mirroring works at the level of the
physical log record. Beginning in SQL Server 2008, the principal
server compresses the stream of transaction log records before
sending it to the mirror server. This log compression occurs in all
mirroring sessions.
Operating Modes
A database mirroring session runs with either synchronous or
asynchronous operation. Under asynchronous operation, the
transactions commit without waiting for the mirror server to write
the log to disk, which maximizes performance. Under synchronous
operation, a transaction is committed on both partners, but at the
cost of increased transaction latency.
There are two mirroring operating modes. One of them, high-safety
mode supports synchronous operation. Under high-safety mode, when a
session starts, the mirror server synchronizes the mirror database
together with the principal database as quickly as possible. As
soon as the databases are synchronized, a transaction is committed
on both partners, at the cost of increased transaction
latency.
The second operating mode, high-performance mode, runs
asynchronously. The mirror server tries to keep up with the log
records sent by the principal server. The mirror database might lag
somewhat behind the principal database. However, typically, the gap
between the databases is small. However, the gap can become
significant if the principal server is under a heavy work load or
the system of the mirror server is overloaded.
© 2008 NetApp. All rights reserved.
SQL Server Database Mirroring
*
High-safety mode with automatic failover requires a third server
instance, known as a witness. Unlike the two partners, the witness
does not serve the database. The witness supports automatic
failover by verifying whether the principal server is up and
functioning. The mirror server initiates automatic failover only if
the mirror and the witness remain connected to each other after
both have been disconnected from the principal server.
The following illustration shows a configuration that includes a
witness.
© 2008 NetApp. All rights reserved.
SQL Server Database Mirroring
*
In high-performance mode, as soon as the principal server sends a
log record to the mirror server, the principal server sends a
confirmation to the client. It does not wait for an acknowledgement
from the mirror server. This means that transactions commit without
waiting for the mirror server to write the log to disk. Such
asynchronous operation enables the principal server to run with
minimum transaction latency, at the potential risk of some data
loss.
All database mirroring sessions support only one principal server
and one mirror server. This configuration is shown in the following
illustration.
© 2008 NetApp. All rights reserved.
SQL Server Database Mirroring
Full (Synchronous)
Off (Asynchronous)
Role Switching
Automatic failover
Manual failover
Forced service
Transaction Safety and Operating Modes
Whether an operating mode is asynchronous or synchronous depends on
the transaction safety setting. If you exclusively use SQL Server
Management Studio to configure database mirroring, transaction
safety settings are configured automatically when you select the
operation mode.
If you use Transact-SQL to configure database mirroring, you must
understand how to set transaction safety. Transaction safety is
controlled by the SAFETY property of the ALTER DATABASE statement.
On a database that is being mirrored, SAFETY is either FULL or
OFF.
If the SAFETY option is set to FULL, database mirroring operation
is synchronous, after the initial synchronizing phase. If a witness
is set in high-safety mode, the session supports automatic
failover.
If the SAFETY option is set to OFF, database mirroring operation is
asynchronous. The session runs in high-performance mode, and the
WITNESS option should also be OFF.
Role Switching
Within the context of a database mirroring session, the principal
and mirror roles are typically interchangeable in a process known
as role switching. Role switching involves transferring the
principal role to the mirror server. In role switching, the mirror
server acts as the failover partner for the principal server. When
a role switch occurs, the mirror server takes over the principal
role and brings its copy of the database online as the new
principal database. The former principal server, if available,
assumes the mirror role, and its database becomes the new mirror
database. Potentially, the roles can switch back and forth
repeatedly.
The following three forms of role switching exist.
Automatic failover
This requires high-safety mode and the presence of the mirror
server and a witness. The database must already be synchronized,
and the witness must be connected to the mirror server.
The role of the witness is to verify whether a given partner server
is up and functioning. If the mirror server loses its connection to
the principal server but the witness is still connected to the
principal server, the mirror server does not initiate a
failover.
Manual failover
This requires high-safety mode. The partners must be connected to
each other, and the database must already be synchronized.
Forced service (with possible data loss)
Under high-performance mode and high-safety mode without automatic
failover, forcing service is possible if the principal server has
failed and the mirror server is available.
Important: High-performance mode is intended to run without a
witness. But if a witness exists, forcing service requires that the
witness is connected to the mirror server.
In any role-switching scenario, as soon as the new principal
database comes online, the client applications can recover quickly
by reconnecting to the database.
© 2008 NetApp. All rights reserved.
Disaster recovery solution for IBM’s DB2
Automates the creation and maintenance of a synchronized copy of
the primary database
If the primary database becomes unavailable, a standby database can
easily assume the primary role
Client automatic failover
What is HADR?
DB2 High Availability Disaster Recovery (HADR) is a database
replication
feature that provides a high availability solution for both partial
and complete site
failures. A database server can fail from any number of factors:
environmental
(power/temperature), hardware, network connectivity, software, or
human
intervention. Recovery from the loss of a database server in a
standard installation can
require a machine reboot and database crash recovery, which
interrupt the
database services. By using DB2 HADR, you can safely minimize the
downtime
to only a few seconds. HADR protects against data loss by
continually replicating
data changes from a source database, called the primary, to a
target database,
called the standby. Furthermore, you can seamlessly redirect
clients that were
using the original primary database to the standby database (which
becomes the
new primary database) by using Automatic Client Reroute (ACR) and
retry logic
in the application. This seamless redirection is also possible
using software
solutions discussed in other chapters, which can manage IP address
redirection.
© 2008 NetApp. All rights reserved.
IBM DB2 HADR
IBM DB2 HADR
Overview
HADR transmits the log records from the primary database server to
the standby
server. The HADR standby replays all the log records to its copy of
the database,
keeping it synchronized with the primary database server. The
standby server is
in a continuous rollforward mode and is always in a state of
near-readiness, so
the takeover to the standby is extremely fast. Applications can
only access the
primary database and have no access to the standby database. Using
two dedicated TCP/IP communication ports and a heartbeat, the
primary
and the standby keep track of where they are processing currently,
and the
current state of replication; perhaps most importantly, whether the
standby
database is up to date with the primary. (known as HADR Peer
state). When a
log buffer is being written to disk on the primary, it is sent to
HADR to be routed
to the standby at the same time.
As mentioned, HADR communication between the primary and the
standby is
through TCP/IP, which enables the database to be replicated to a
remote
geographical site. This allows the database to be recovered either
locally or at
the remote site in case of a disaster at the primary database
server site. The
HADR solution thus provides both High Availability and Disaster
Recoverability.
As you can appreciate, HADR provides an incomparable improvement
on
conventional methods for DB2 disaster recovery, which otherwise
could mean
losses in terms of hours of committed transaction data. If the
primary system is not available, a takeover HADR by force
operation
converts the standby system to be the new primary system. After
the
maintenance or repair is done in the old primary system, you can
bring the
original primary DB2 up and return both DB2 servers to their
primary and
*
HADR Topology
With HADR, you can choose the level of protection you want from
potential loss
of data by specifying one of the three synchronization modes:
Synchronous:
Log write is considered successful only when the log buffers have
been
written to the primary’s log files and after an acknowledgement has
been
received that it has also been written to the standby’s log files.
There can be
no data loss in this mode if HADR is in Peer state.
Near-synchronous:
This is the default option. Log write is successful only when the
primary’s log
buffer has been written to log files on the primary and an
acknowledgement is
received that the log buffer has been received on the
standby.
Asynchronous:
Log write is successful when logs have been written to the disk on
the primary
and log data has been sent through TCP/IP to the standby. Data loss
can
occur in this mode.
HADR Topology
Synchronization modes
SnapMirror
Simple, flexible and cost-effective
Operate with FC or IP network
Mirror to/from any NetApp system
Multi-hop, cascading
Backup data can be made writeable
LAN
*
On the top is a production environment: servers connected via a LAN
to FAS devices. Below is the disaster recovery site – a duplicate
of the original- with SnapMirror technology in between. (click) The
value of this SnapMirror architecture is that it’s simple, flexible
and cost-effective. (click)
First, it’s simple. It’s very easy to configure. It takes minutes
to setup instead of the days or weeks needed for alternative
mirroring solutions. Furthermore, it can run across a standard IP
network which simplifies network configuration. (click)
Secondly, it’s flexible and can run on all NetApp platforms.
Because NetApp devices have the capability to replicate to any
other NetApp devices- customers can, for example, replicate over
fiber channel or IP without protocol converters. This provides the
flexibility to choose the frequency of replication. They can also
replicate multiple systems into a single system, a single system
into multiple systems, or cascade replications where one NetApp
device replicates to another which in turn replicates to a third.
(click)
Finally, SnapMirror is very cost effective. One license allows
customers to implement SnapMirror in any mode. Which means they can
configure an asynchronous replication all the way up to a
synchronous replication. SnapMirror doesn’t replicate the entire
data set but only the block level incremental changes. This
minimizes both the required network bandwidth and the cost. NetApp
provides read-only access to replicated data so remote users can
access it locally. The data on the target side can be leveraged for
other uses such as application testing or data mining. Being able
to replicate between any NetApp device also provides the
flexibility to deploy low cost systems as disaster recovery
targets. This reduces, even further, the investment needed to
deploy disaster recovery.
So let’s walk through some of the key features and benefits.
15.bin
16.bin
SnapMirror Flexibility
Synchronous SnapMirror
*
Here are the three: synchronous, semi-synchronous and asynchrous.
(click) With synchronous, mirroring data at the disaster recovery
site exactly matches data at the primary site. This is achieved by
replicating every datawrite to the remote location -without
acknowledging to the host that the write has occurred- until it’s
confirmed by the remote system. (click)
This solution provides the least data loss. However, because the
host application must wait for an acknowledgement from the remote
NetApp devices, it’s limited to a distance of 50-100km before
latency becomes too great. (click)
Semi-Synchronous SnapMirrorring allows you to achieve a near zero
data loss disaster recovery solution- without the performance
impact on the host application. (click) This means you can do
synchronous type replication over longer distances. When data is
written to the primary storage, an acknowledgement is immediately
sent back eliminating the latency impact on the host. In the
background, SnapMirror tries to maintain as close to synchronous
communication as possible with the remote system and leverages user
defined thresholds on the discrepancy between the source and remote
copy data sets. (click)
Finally, Asynchronous SnapMirroring replicates data at adjustable
frequencies. (click) The benefit here is you can do point-in-time
replication as frequently as every minute, or as infrequently as
every few days. There’s no distance limitation so it’s often used
to replicate over long distances to protect against regional
disasters. And since only the blocks that have changed between each
replication are sent, it minimizes network usage.
There are also a number of SnapMirror deployment options.
© 2008 NetApp. All rights reserved.
SnapMirror® Deployment Options
With SnapMirror we can configure multiple hop configurations. The
common configuration consists of synchronous replication between
two devices within a metro area, adding additional physical
distance between servers for better, broader disaster recovery
protection. (click)
We can also add in a longer hop with an asynch connection cascading
out of the target. We can do many-to-one replications or
one-to-many replications. (click)
With a NetApp V-series device, we can actually replicate a
competitor’s storage array to a NetApp device. This extends the
benefits of SnapMirror to heterogeneous environments.
In addition to general replication, customers have the ability to
leverage the data for other uses.
© 2008 NetApp. All rights reserved.
NetApp Solution:
Primary Data Center
Replication
*
The animations are the same, but there are a few differences to
note. Here on the left you see a windows application such as
Exchange or SQL Server. (click)
In this solution, SnapMirror integrates with the Windows
application through SnapManager and SnapDrive. (click)
Once SnapManager creates a snapshot in the Windows application, it
begins to replicate the data to a remote site. The replication
frequency established within SnapManager for Snapshots is the same
used by SnapMirror for replication frequency. (click)
In this environment you can actually set up different mirroring
relationships so the database can be replicated to a certain
frequency once a day or twice a day depending on the available
bandwidth that you have. Logs can be replicated very often, around
fifteen minutes and sometimes even synchronously.
© 2008 NetApp. All rights reserved.
Application Mirroring Vs. SnapMirror