Database Disaster Recovery

Database Disaster Recovery*
Agenda
Customer Challenges
NetApp Confidential - Limited Use
Let’s first understand the customer challenges.
Challenging economic times are compelling businesses to achieve even greater levels of cost savings and operational efficiency. Yet business-critical applications still require vital data to be protected and available to meet increasing service-level demands. The majority of businesses that fail to protect their critical applications do not get a second chance, and those that fail to reduce their operational expenses may suffer the same fate.
This chart represents the results of an online survey conducted by Forester Research and Disaster Recovery Journal that included responses from 250 IT decision makers.
The key takeaways:
The overwhelming majority of declared disasters are caused by internal data center failures. In fact, more than 75% of those surveyed have experienced an outage inside the data center. These common causes of downtime include things such as power, hardware, and network. While, the more catastrophic disasters such as flood, hurricane, and fire that affect the entire data center are rare occurrences.
*
Data Availability & Data Protection solution for Oracle
Automates the creation and maintenance of one or more synchronized copies of the primary database
If the primary database becomes unavailable, a standby database can easily assume the primary role
Standby databases can be used for queries, reporting, testing, or backups while in standby role
Feature of Oracle Database Enterprise Edition (EE)
What is Data Guard?
Oracle Data Guard Architecture
Online
Redo
Logs
7
On the primary database, Data Guard uses a specialized background process, called LogWriter Network Server (LNS), to capture redo data being written by Log Writer and synchronously or asynchronously transmit the redo data to the standby database. The LNS process isolates Log Writer from the overhead of transmission and from network disruptions.
Synchronous redo transport (SYNC) requires the Log Writer on the primary database to wait for acknowledgement from LNS that the standby has received the redo data and written it to a standby redo log before it can acknowledge the commit to the client application. This insures all committed transactions are on disk and protected at the standby location.
Asynchronous redo transport (ASYNC) does not require the Log Writer on the primary to wait for acknowledgment from the standby database that redo has been written to disk; commits are acknowledged to the client application asynchronous of redo transport.
In addition, ASYNC very efficiently streams redo to the standby database to eliminate any overhead caused by network acknowledgment that would otherwise reduce network throughput.
In cases where the primary and standby databases become disconnected (network failures or standby server failures), and depending upon the protection mode chosen (protection modes are discussed later in this paper), the primary database will continue to process transactions and accumulate a backlog of redo data that cannot be shipped to the standby until a new network connection can be established (referred to as an archive log gap). While in this state, Data Guard continually monitors standby database status, detects when connection is re-established, and automatically resynchronizes the standby database with the primary to return the configuration to a protected state as fast as possible.
© 2008 NetApp. All rights reserved.
Data Guard Redo Apply
Physical Standby Database is a block-for-block copy of the primary database
Uses the database recovery functionality to apply changes
Can be opened in read-only mode while apply is active for reporting/queries
Can also be used for backups, offloading production database
Primary
Database
Data Guard SQL Apply
Contains the same logical information (rows) as the production database
Physical organization and structure can be very different
Can host multiple schemas
Can be queried for reports while logs are being applied via SQL
Can create additional indexes and materialized views for better query performance
Additional
Indexes &
Maintains a physical, block-for-block copy of the primary
Can be open for read-only queries
At role transition, offers the assurance that the standby database chosen to be the new primary has not been changed compared to the old primary
Can be used for backups
Faster, since it bypasses the SQL transformation layer
Maintains a logical, transaction-for-transaction copy of the primary
Allows creation of additional objects, modification of objects
Possible to skip apply on certain objects
Can be used as a good reporting solution – supports real-time reporting in 10g
Has datatype restrictions
Flexible Data Protection Modes
Protection Mode
Synchronous redo shipping
Synchronous redo shipping
Asynchronous redo shipping
Switchover and Failover
Switchover
Failover
Use Flashback Database to reinstante original primary
Manually execute via simple SQL / GUI interface, or
Automate failover using Data Guard Fast-Start Failover
*
An Oracle database operates in one of two roles: primary or standby. Data Guard helps
you change the role of a database using either a switchover or a failover:
A switchover is a role reversal between the primary database and one of its
standby databases. A switchover guarantees no data loss and is typically done for
planned maintenance of the primary system. During a switchover, the primary
database transitions to a standby role, and the standby database transitions to the
primary role.
A failover is done when the primary database (all instances of an Oracle RAC
primary database) fails or has become unreachable and one of the standby
databases is transitioned to take over the primary role. Failover should be
performed when the primary database cannot be recovered in a timely manner.
Failover may or may not result in data loss depending on the protection mode in
effect at the time of the failover.
Without the broker, you perform role transitions by first determining if a role
transition is necessary and then issuing a series of SQL statements (as described in
Oracle Data Guard Concepts and Administration). The broker simplifies switchovers and
failovers by allowing you to invoke them using a single key click in Oracle Enterprise
Manager or a single command in the DGMGRL command-line interface (referred to in
this documentation as manual failover). Moreover, you can enable fast-start failover to
fail over automatically when the conditions for fast-start failover are met. When
fast-start failover is enabled, the broker determines if a failover is necessary and
initiates the failover to the specified target standby database automatically, with no
need for DBA intervention.
Fast-start failover allows you to increase availability with less need for manual
intervention, thereby reducing management costs. Manual failover gives you control
over exactly when a failover occurs and to which target standby database. Regardlessof the method you choose, the broker coordinates the role transition on all databases in
the configuration.
When the database is opened for the first time after a role transition, the DB_ROLE_
CHANGE system event fires. You can write a trigger that's associated with this system
event to manage tasks after a role change occurs. See the table of system manager
events in Oracle Database Advanced Application Developer's Guide for more details. After
a failover, the broker posts the DB_DOWN HA event in addition to firing the DB_ROLE_
CHANGE system event. You may use both the DB_DOWN HA event and the DB_ROLE_
CHANGE system event to, for example, help user applications locate services on the
new primary database.
Data Guard Management Interfaces
SQL*Plus Command Line
Data Guard Broker
DGMGRL Command Line
DMON process running on all databases in a Data Guard config
Simpler, single statement commands that perform the work of multiple SQL*Plus commands
Attach to any member of a Data Guard configuration and manage all members as a single configuration
Enterprise Manager Grid Control
No separate license purchase required
*
Snapshot Standby
Increase ROI
Preserves zero data loss – continuous redo transport while open read-write
Truly leverages standby database and DR hardware for multiple purposes
Similar to storage snapshots, but provides DR at the same time and uses
single copy of storage
*
*
A Data Guard 10.2 physical standby can be open read-write (for testing / cloning), but it can’t continue to receive redo from the primary
A physical standby thus opened is a regular read/write database with no connection to the primary
It can be converted back to be a physical standby
After conversion back to a physical standby, redo accumulated on the primary has to be sent across the network to re-sync
Data can be lost if a failure occurs before re-sync is complete
Data Guard 11g Snapshot Standby solves this problem
Applications, backups, reports run on production only
Traditional Physical Standby Databases
Real-time
Queries
Standby
Database
Production
Database
Simple, fast, supports all data types and applications
In Data Guard 10g
Can be open read-only, but Redo Apply has to stop
Latest data is not available for query or reports
Also prolongs switchover / failover
Oracle Active Data Guard 11g – a new Database Option
Real-time Query enables read-only access to a physical standby database while Redo Apply is active
Offload read-only queries to physical standby
Offload fast incremental backups to physical standby
Active Data Guard 11g
Increase ROI - Real-time Query
Supports RAC on primary and/or standby
Queries see transactionally consistent results
Handles all data types, but not as flexible as logical standby
In read-only standby, can not write to TEMP tables in the current release
If auditing is required, must audit to O/S
For applications that need to write result tables, login info etc, create a stub database and from it link to standby.
RMAN block change tracking on standby database
Fast incremental backups complete 20x faster
Active Data Guard Benefits
Simultaneous read & recovery
Complex replication used to create reporting replica
Performance protection
*
*
So now you can see how Oracle Database 11g with Active Data Guard provides new ways to enhance the Quality of Service for mission critical applications.
Only Active Data Guard can continue to apply primary database changes to an identical replica that can be simultaneously be used for real-time query or reporting, using a process that is as simple to implement with such high performance. This will make standby databases a more integral part of data centers driven as much by the desire to enhance the quality of service of the production database as by concerns for data protection and availability.
Simplicity, High Performance and Reliability – are key characteristics of Active Data Guard that differentiate it from traditional replication methods.
Microsoft High Availability
Failover clustering
Failover clustering provides high-availability support for an entire instance of SQL Server. A failover cluster is a combination of one or more nodes, or servers, with two or more shared disks. Applications such as SQL Server and Notification Services are each installed into a Microsoft Cluster Service (MSCS) cluster group, known as a resource group.
A failover cluster does not protect against disk failure. You can use failover clustering to reduce system downtime and provide higher application availability. Failover clustering is supported in SQL Server 2005 Enterprise Edition, Developer Edition and, with some restrictions, Standard Edition. For more information about failover clustering, see Failover Clustering and Installing a Failover Cluster.
Database mirroring
Database mirroring is primarily a software solution to increase database availability by supporting almost instantaneous failover. Database mirroring can be used to maintain a single standby database, or mirror database, for a corresponding production database that is referred to as the principal database.
Each database mirroring configuration involves a principal server that contains the principal database, and a mirror server that contains the mirror database. The mirror server continuously brings the mirror database up to date with the principal database.
Log shipping
Like database mirroring, log shipping operates at the database level. Log shipping can be used to maintain one or more warm standby databases, referred to as secondary databases, for a corresponding production database that is referred to as the primary database. Each secondary database is created by restoring a database backup of the primary database with no recovery, or with standby. Restoring with standby permits the resulting secondary database to be used for limited reporting purposes.
Before a failover can occur, a secondary database must be brought fully up-to-date by manually applying any unrestored log backups.
Log shipping provides the flexibility of supporting multiple standby databases. If you require multiple standby databases, you can use log shipping alone or as a supplement to database mirroring. When these solutions are used together, the current principal database of the database mirroring configuration is also the current primary database of the log shipping configuration.
Replication
Replication uses a publish-subscribe model, allowing a primary server, referred to as the Publisher, to distribute data to one or more secondary servers, or Subscribers. Replication allows real-time availability and scalability across these servers. It supports filtering to provide a subset of data at Subscribers, and also allows partitioned updates. Subscribers are online and available for reporting or other functions, without query recovery. SQL Server offers three types of replication: snapshot, transactional, and merge. Transactional replication provides the lowest latency and is most commonly used for high availability. For more information, see Improving Scalability and Availability.
Microsoft SQL Server Database Mirroring
Benefits
Operating modes
How Database Mirroring Works
The principal and mirror servers communicate and cooperate as partners in a database mirroring session. The two partners perform complementary roles in the session: the principal role and the mirror role. At any given time, one partner performs the principal role, and the other partner performs the mirror role. Each partner is described as owning its current role. The partner that owns the principal role is known as the principal server, and its copy of the database is the current principal database. The partner that owns the mirror role is known as the mirror server, and its copy of the database is the current mirror database. When database mirroring is deployed in a production environment, the principal database is the production database.
Database mirroring involves redoing every insert, update, and delete operation that occurs on the principal database onto the mirror database as quickly as possible. Redoing is accomplished by sending a stream of active transaction log records to the mirror server, which applies log records to the mirror database, in sequence, as quickly as possible. Unlike replication, which works at the logical level, database mirroring works at the level of the physical log record. Beginning in SQL Server 2008, the principal server compresses the stream of transaction log records before sending it to the mirror server. This log compression occurs in all mirroring sessions.
Operating Modes
A database mirroring session runs with either synchronous or asynchronous operation. Under asynchronous operation, the transactions commit without waiting for the mirror server to write the log to disk, which maximizes performance. Under synchronous operation, a transaction is committed on both partners, but at the cost of increased transaction latency.
There are two mirroring operating modes. One of them, high-safety mode supports synchronous operation. Under high-safety mode, when a session starts, the mirror server synchronizes the mirror database together with the principal database as quickly as possible. As soon as the databases are synchronized, a transaction is committed on both partners, at the cost of increased transaction latency.
The second operating mode, high-performance mode, runs asynchronously. The mirror server tries to keep up with the log records sent by the principal server. The mirror database might lag somewhat behind the principal database. However, typically, the gap between the databases is small. However, the gap can become significant if the principal server is under a heavy work load or the system of the mirror server is overloaded.
SQL Server Database Mirroring
*
High-safety mode with automatic failover requires a third server instance, known as a witness. Unlike the two partners, the witness does not serve the database. The witness supports automatic failover by verifying whether the principal server is up and functioning. The mirror server initiates automatic failover only if the mirror and the witness remain connected to each other after both have been disconnected from the principal server.
The following illustration shows a configuration that includes a witness.
*
In high-performance mode, as soon as the principal server sends a log record to the mirror server, the principal server sends a confirmation to the client. It does not wait for an acknowledgement from the mirror server. This means that transactions commit without waiting for the mirror server to write the log to disk. Such asynchronous operation enables the principal server to run with minimum transaction latency, at the potential risk of some data loss.
All database mirroring sessions support only one principal server and one mirror server. This configuration is shown in the following illustration.
Full (Synchronous)
Off (Asynchronous)
Role Switching
Automatic failover
Manual failover
Forced service
Transaction Safety and Operating Modes
Whether an operating mode is asynchronous or synchronous depends on the transaction safety setting. If you exclusively use SQL Server Management Studio to configure database mirroring, transaction safety settings are configured automatically when you select the operation mode.
If you use Transact-SQL to configure database mirroring, you must understand how to set transaction safety. Transaction safety is controlled by the SAFETY property of the ALTER DATABASE statement. On a database that is being mirrored, SAFETY is either FULL or OFF.
If the SAFETY option is set to FULL, database mirroring operation is synchronous, after the initial synchronizing phase. If a witness is set in high-safety mode, the session supports automatic failover.
If the SAFETY option is set to OFF, database mirroring operation is asynchronous. The session runs in high-performance mode, and the WITNESS option should also be OFF.
Role Switching
Within the context of a database mirroring session, the principal and mirror roles are typically interchangeable in a process known as role switching. Role switching involves transferring the principal role to the mirror server. In role switching, the mirror server acts as the failover partner for the principal server. When a role switch occurs, the mirror server takes over the principal role and brings its copy of the database online as the new principal database. The former principal server, if available, assumes the mirror role, and its database becomes the new mirror database. Potentially, the roles can switch back and forth repeatedly.
The following three forms of role switching exist.
Automatic failover
This requires high-safety mode and the presence of the mirror server and a witness. The database must already be synchronized, and the witness must be connected to the mirror server.
The role of the witness is to verify whether a given partner server is up and functioning. If the mirror server loses its connection to the principal server but the witness is still connected to the principal server, the mirror server does not initiate a failover.
Manual failover
This requires high-safety mode. The partners must be connected to each other, and the database must already be synchronized.
Forced service (with possible data loss)
Under high-performance mode and high-safety mode without automatic failover, forcing service is possible if the principal server has failed and the mirror server is available.
Important: High-performance mode is intended to run without a witness. But if a witness exists, forcing service requires that the witness is connected to the mirror server.
In any role-switching scenario, as soon as the new principal database comes online, the client applications can recover quickly by reconnecting to the database.
Disaster recovery solution for IBM’s DB2
Automates the creation and maintenance of a synchronized copy of the primary database
If the primary database becomes unavailable, a standby database can easily assume the primary role
Client automatic failover
What is HADR?
DB2 High Availability Disaster Recovery (HADR) is a database replication
feature that provides a high availability solution for both partial and complete site
failures. A database server can fail from any number of factors: environmental
(power/temperature), hardware, network connectivity, software, or human
intervention. Recovery from the loss of a database server in a standard installation can
require a machine reboot and database crash recovery, which interrupt the
database services. By using DB2 HADR, you can safely minimize the downtime
to only a few seconds. HADR protects against data loss by continually replicating
data changes from a source database, called the primary, to a target database,
called the standby. Furthermore, you can seamlessly redirect clients that were
using the original primary database to the standby database (which becomes the
new primary database) by using Automatic Client Reroute (ACR) and retry logic
in the application. This seamless redirection is also possible using software
solutions discussed in other chapters, which can manage IP address redirection.
IBM DB2 HADR
IBM DB2 HADR
Overview
HADR transmits the log records from the primary database server to the standby
server. The HADR standby replays all the log records to its copy of the database,
keeping it synchronized with the primary database server. The standby server is
in a continuous rollforward mode and is always in a state of near-readiness, so
the takeover to the standby is extremely fast. Applications can only access the
primary database and have no access to the standby database. Using two dedicated TCP/IP communication ports and a heartbeat, the primary
and the standby keep track of where they are processing currently, and the
current state of replication; perhaps most importantly, whether the standby
database is up to date with the primary. (known as HADR Peer state). When a
log buffer is being written to disk on the primary, it is sent to HADR to be routed
to the standby at the same time.
As mentioned, HADR communication between the primary and the standby is
through TCP/IP, which enables the database to be replicated to a remote
geographical site. This allows the database to be recovered either locally or at
the remote site in case of a disaster at the primary database server site. The
HADR solution thus provides both High Availability and Disaster Recoverability.
As you can appreciate, HADR provides an incomparable improvement on
conventional methods for DB2 disaster recovery, which otherwise could mean
losses in terms of hours of committed transaction data. If the primary system is not available, a takeover HADR by force operation
converts the standby system to be the new primary system. After the
maintenance or repair is done in the old primary system, you can bring the
original primary DB2 up and return both DB2 servers to their primary and
*
HADR Topology
With HADR, you can choose the level of protection you want from potential loss
of data by specifying one of the three synchronization modes:
Synchronous:
Log write is considered successful only when the log buffers have been
written to the primary’s log files and after an acknowledgement has been
received that it has also been written to the standby’s log files. There can be
no data loss in this mode if HADR is in Peer state.
Near-synchronous:
This is the default option. Log write is successful only when the primary’s log
buffer has been written to log files on the primary and an acknowledgement is
received that the log buffer has been received on the standby.
Asynchronous:
Log write is successful when logs have been written to the disk on the primary
and log data has been sent through TCP/IP to the standby. Data loss can
occur in this mode.
HADR Topology
Synchronization modes
SnapMirror
Simple, flexible and cost-effective
Operate with FC or IP network
Mirror to/from any NetApp system
Multi-hop, cascading
Backup data can be made writeable
LAN
*
On the top is a production environment: servers connected via a LAN to FAS devices. Below is the disaster recovery site – a duplicate of the original- with SnapMirror technology in between. (click) The value of this SnapMirror architecture is that it’s simple, flexible and cost-effective. (click)
First, it’s simple. It’s very easy to configure. It takes minutes to setup instead of the days or weeks needed for alternative mirroring solutions. Furthermore, it can run across a standard IP network which simplifies network configuration. (click)
Secondly, it’s flexible and can run on all NetApp platforms. Because NetApp devices have the capability to replicate to any other NetApp devices- customers can, for example, replicate over fiber channel or IP without protocol converters. This provides the flexibility to choose the frequency of replication. They can also replicate multiple systems into a single system, a single system into multiple systems, or cascade replications where one NetApp device replicates to another which in turn replicates to a third. (click)
Finally, SnapMirror is very cost effective. One license allows customers to implement SnapMirror in any mode. Which means they can configure an asynchronous replication all the way up to a synchronous replication. SnapMirror doesn’t replicate the entire data set but only the block level incremental changes. This minimizes both the required network bandwidth and the cost. NetApp provides read-only access to replicated data so remote users can access it locally. The data on the target side can be leveraged for other uses such as application testing or data mining. Being able to replicate between any NetApp device also provides the flexibility to deploy low cost systems as disaster recovery targets. This reduces, even further, the investment needed to deploy disaster recovery.
So let’s walk through some of the key features and benefits.
15.bin
16.bin
SnapMirror Flexibility
Synchronous SnapMirror
*
Here are the three: synchronous, semi-synchronous and asynchrous. (click) With synchronous, mirroring data at the disaster recovery site exactly matches data at the primary site. This is achieved by replicating every datawrite to the remote location -without acknowledging to the host that the write has occurred- until it’s confirmed by the remote system. (click)
This solution provides the least data loss. However, because the host application must wait for an acknowledgement from the remote NetApp devices, it’s limited to a distance of 50-100km before latency becomes too great. (click)
Semi-Synchronous SnapMirrorring allows you to achieve a near zero data loss disaster recovery solution- without the performance impact on the host application. (click) This means you can do synchronous type replication over longer distances. When data is written to the primary storage, an acknowledgement is immediately sent back eliminating the latency impact on the host. In the background, SnapMirror tries to maintain as close to synchronous communication as possible with the remote system and leverages user defined thresholds on the discrepancy between the source and remote copy data sets. (click)
Finally, Asynchronous SnapMirroring replicates data at adjustable frequencies. (click) The benefit here is you can do point-in-time replication as frequently as every minute, or as infrequently as every few days. There’s no distance limitation so it’s often used to replicate over long distances to protect against regional disasters. And since only the blocks that have changed between each replication are sent, it minimizes network usage.
There are also a number of SnapMirror deployment options.
SnapMirror® Deployment Options
With SnapMirror we can configure multiple hop configurations. The common configuration consists of synchronous replication between two devices within a metro area, adding additional physical distance between servers for better, broader disaster recovery protection. (click)
We can also add in a longer hop with an asynch connection cascading out of the target. We can do many-to-one replications or one-to-many replications. (click)
With a NetApp V-series device, we can actually replicate a competitor’s storage array to a NetApp device. This extends the benefits of SnapMirror to heterogeneous environments.
In addition to general replication, customers have the ability to leverage the data for other uses.
NetApp Solution:
Primary Data Center
Replication
*
The animations are the same, but there are a few differences to note. Here on the left you see a windows application such as Exchange or SQL Server. (click)
In this solution, SnapMirror integrates with the Windows application through SnapManager and SnapDrive. (click)
Once SnapManager creates a snapshot in the Windows application, it begins to replicate the data to a remote site. The replication frequency established within SnapManager for Snapshots is the same used by SnapMirror for replication frequency. (click)
In this environment you can actually set up different mirroring relationships so the database can be replicated to a certain frequency once a day or twice a day depending on the available bandwidth that you have. Logs can be replicated very often, around fifteen minutes and sometimes even synchronously.
Application Mirroring Vs. SnapMirror

Documents

Database Disaster Recovery