Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1
<Insert Picture Here>
Oracle Data Guard: Defining the Next Era in Data Availability and Data Protection Ashish Ray Lee ParsonsGroup Product Manager Database Engineering ManagerOracle [email protected] [email protected]
3
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.The development, release, and timing of any features or functionality described for Oracle’s products remain at the sole discretion of Oracle.
4
<Insert Picture Here>Agenda
• Disaster Recovery (DR) – Common Concerns • Data Guard – A Quick Introduction• Data Guard – Extending Beyond DR• Amazon.com – Beyond Custom Physical Standbys
5
Common Concerns for DR Solutions• Roadblocks for adoption of DR solutions
• Perception around the term “Disaster”• “Disaster” often linked to destructive events that occur
infrequently, so no strong urge to implement a DR solution“When it happens, we will see.”“We do tape backups, and that should be fine, right?”
• Shortcomings of existing solutionsMost DR solutions involve redundant systems that can’t be utilized for productive useThe solutions are expensive, with no immediate ROI (till “disaster” occurs)
“We don’t have budget for machines basically sitting idle.”
6
What is a “Disaster”?
• Well-recognized disasters such as headline-grabbing events• Fire, earthquake, tsunami, flood, hurricane, …
• What about more mundane events that still cause outage but occur much more frequently?
• Faulty system components – server, network, storage, software, …• Data corruptions• Backup/recovery of bad data• Wrong batch job• Bad HW/SW installations / upgrades / patching• Operator errors• Power outages• Etc.
7
• Examples of the errors observed in the alert.log of the production database:• Errors in file /opt/app/oracle/admin/dg/bdump/dg1.trc:• ORA-01186 : file 93 failed verification tests• ORA-01122 : database file 93 failed verification check• ORA-01110 : data file 93: '/dbmnt/db01/oradata/dg/arch05.dg'• ORA-01251 : Unknown File Header Version read for file number 93
• ORA-01251 - Corrupted file header. This could be caused due to missed read or write or hardware problem or process external to oracle overwriting the information in file header.
• Affected database: one of the most critical databases supporting its retail businesses
• Supports the firm’s primary customer facing applications for trade transaction confirmation, new accounts, and customer account information
Real-life “Disaster”Financial Services Company
Traditional DR solutions such as storage mirroring would propagate this corruption to target storage volumes, rendering them useless as well
8
Needed: Next-Generation DR Solution Comprehensive Availability & Protection
• Data AvailabilityOutages should be tolerated transparentlyOutages should be recovered from quickly
• Data ProtectionStandby data should be isolated from production faultsNo data should be lost
• Systems UtilizationStandby resources should be utilized for productive use
• Fully integrated in a cost-effective mannerThat’s where Data Guard comes in!
9
• Thankfully, they already had Data Guard implemented• Physical Standby, Maximum Availability• Data Guard architecture prevented corruption from affecting their
standby databases
• Failed over to the standby database• New production database up in minutes, no loss of data
• Independently investigated problems at original production server• Problem traced to faulty storage array component• Took a few days to investigate and fix the problem
Remember – Real-life “Disaster”? What Did They Do?
10
• Data Availability & Data Protection solution for Oracle
• Automates the creation and maintenance of one or more synchronized copies (standby) of the production (or primary) database
• If the primary database becomes unavailable, a standby database can easily assume the primary role
• Feature of Oracle Database Enterprise Edition (EE)• Feature available at no extra cost• Primary and standby databases need to be licensed EE
What is Data Guard?
11
Oracle’s Integrated HA Solution Set
System Failures
Data Failures
System Changes
Data Changes
UnplannedDowntime
PlannedDowntime
Real Application Clusters
ASMFlashback
RMAN & Oracle Secure BackupH.A.R.D
Data GuardStreams
Online ReconfigurationRolling Upgrades
Online Redefinition
Oracle M
AA
Best Practices
12
Data Guard Configuration
• Managed as a single configuration• Primary and standby databases can be Real Application Clusters
or single-instance Oracle• Up to nine standby databases supported in a single configuration
PrimaryDatabase
StandbyDatabase
Standby Site A
Standby Site B
Primary Site
StandbyDatabase
Broker
13
Data Guard – DR and Beyond
1. High data availability
2. Comprehensive data protection
3. Efficient systems utilization
Availability Protection Utilization
Data Guard Utility Meter
14
High Data Availability
High Data Availability requirements:Maintain high availability from
Server FailuresNetwork Failures
Needed a failover mechanism that is fast, automatic, doesn’t lose data
Standby Site BPrimary Site A Primary Site BStandby Site A
Role transition following a Disaster / Outage
15
Comprehensive Data Protection
Standby Site BPrimary Site A
Comprehensive Data Protection requirements:Provide bullet-proof protection from
Storage FailuresSite FailuresData CorruptionsOperator Errors
16
Efficient Systems Utilization
Standby Site BPrimary Site A
End-users
Administrators
Efficient System Utilization requirements:Standby resources should be used productively by
Administrators for planned maintenance operationsEnd-users for application access
17
Data Guard – DR and Beyond
1. High Data Availability, from:Server FailuresNetwork Failures
Server Failures
Network Failures
Availability
18
Data Guard – Combined High Availability and Disaster Recovery
• Provided through the Fast-Start Failover feature• Data Guard automatically fails over to designated
standby database• Standby can become a primary in a few seconds• Application clients may also be automatically re-
connected to the new primary database• No manual intervention, no data loss• Protection from disasters / outages
19
Fast-Start Failover
Standby SitePrimary Site
Observer
1.Data Guard in steady state – transmitting redo2.Observer monitoring state of the configuration
20
Fast-Start Failover
Standby SitePrimary Site
Observer
3. Disaster strikes the primary – connections lost
21
Fast-Start Failover
Standby SitePrimary Site
Observer
4. Observer <=> primary connection times out (timeout threshold configurable)5. Observer asks target standby if it is ready to fail over6. Observer begins Fast-Start Failover
22
Fast-Start Failover
Observer
Primary Site
7. Target standby automatically becomes new primary
23
Fast-Start Failover
Observer
Standby Site Primary Site
8. After old primary is repaired, Observer re-establishes connection9. Observer automatically reinstates old primary to be a new standby10. Redo transmission starts from new primary to new standby
24
Fast-Start Failover – How Fast?
0
5
10
15
20
25
AverageFailover
Time (seconds)
Physical Standby Logical Standby
Single Instance RAC
Figure 2: Fast-Start Failover Test Results
From MAA paper: Fast-Start Failover Best Practices, http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
25
Fast-Start Failover – Operational Tips• Requires:
• Data Guard Broker to be enabled• Maximum Availability protection mode• Flashback Database to be enabled for auto-reinstate
• Occurs during:• Network / cluster failures• Shutdown abort / datafiles offline
• Best PracticesPlace Observer in the same network segment as the middle tiersMonitoring – FS_FAILOVER_STATUS in v$databaseSet DB_FLASHBACK_RETENTION_TARGET to a minimum of 60 minsWith Grid Control Agent installed, Observer can be automaticallyrestarted if the Observer process were to ever stopPossible to configure multiple Observers on the same server monitoring their own Data Guard configurations
26
Data Guard – DR and Beyond
2. Comprehensive Data Protection, fromStorage failuresSite failuresData corruptionsOperator errors
Server Failures
Network Failures
StorageFailures
SiteFailures
DataCorruptions
OperatorErrors
Availability Protection
27
Data Guard: Basic DR (Of Course …)Standby SitePrimary Site
1Standby SitePrimary Site
2
New Primary SiteOld Primary Site3
Primary SiteStandby Site4
28
Data Corruption Protection by Data Guard
• Faulty system component could physically corrupt data files / redo log files / control file, affecting primary database operations• Any component can fail: file system, volume manager, device
driver, host bus adapter, storage controller, disk drive• Remember the earlier real-life disaster example?
• Data Guard protection• Robust checks and balances in place to ensure physical data
corruptions on primary database do not affect standby database
• Some real-life examples follow …
29
Data Protection in Action – Example 1• Example: Standby Redo Log corrupted, however Archiver
detects it on standby database, corruption does not spread*** 2005-04-04 20:33:24.670Archiving standby databaseSelected standby logfile…Corrupt redo block 17457 detected: bad checksumSeq: 0x00000bfb Block: 0x00004431 Time: 554759965 Beg: 0x10 Cks: 0x8deb*** 2005-04-04 20:33:25.467ARC0: All Archive destinations made inactive due to error 354*** 2005-04-04 20:33:25.467ORA-00354: corrupt redo log block headerORA-00353: log corruption near block 17457 change 0 time 04/04/2005 19:59:25ORA-00312: online log 11 thread 1: 'D:\REDO01.DBF' *** 2005-04-04 20:33:25.498ARC0: Archiving not possible: error count exceededORA-16038: log 11 sequence# 0 cannot be archived
30
Data Protection in Action – Example 2• Example: archivelog corrupted, and Redo Apply on standby database
detects it and stops applyingTue Jun 22 18:25:29 2004Errors in file ora_3506.trc:ORA-01115: IO error reading block from file 480 (block # 67249)ORA-01110: data file 480: ‘replication_01.dbf’ORA-27091: skgfqio: unable to queue I/OORA-27072: skgfdisp: I/O errorLinux Error: 14: Bad addressAdditional information: 67248ORA-00368: checksum error in redo log blockORA-00353: log corruption near block 281856 change 5682682353914 time 06/22/2004 16:18:43ORA-00334: archived log: ‘redolog_01.arc’…Media Recovery failed with error 368
31
How Does Data Guard Ensure Data Protection?
• Data Guard: a loosely coupled architectureStandby databases kept synchronized through redo blocks, completely detached from possible datafile corruptions on primaryIn some redo transport configurations, redo is shipped from primary SGA, and thus detached from physical I/O corruptions on primarySoftware code-path executed on standby fundamentally different from that of primary – effective seclusion from software errors
• Corruption-detection checks at key interfacesPrimary: during Redo Transport: LGWR, LNS, ARCHStandby: during Redo Apply: RFS, ARCH, MRP, LSP, DBWR
• If redo corruption detected on standby, Data Guard tries to re-fetch valid logs as part of archivelog gap handling
• Fundamental Principle: Primary database corruption should not affect the standby database
32
Data Corruption Protection: Operational Tips
• Two key parameters:db_block_checksum (OFF | TYPICAL | FULL)• Determines whether checksum computed, stored and verified for
data & redo blocksdb_block_checking (OFF | LOW | MEDIUM | FULL)• Semantic block checking for data blocks
• Recommended settings:
Primary Standby
db_block_checksum TYPICAL TYPICAL
db_block_checking MEDIUM MEDIUM*
* Check for possible impact
33
Utilizing Data Guard upon Data Corruption in the Primary Database
• If data blocks critical to application functionality are corrupted• Perform Data Guard switchover / failover to standby
Resumes application availability with a new valid primary databaseCorruption issues on new standby can be investigated offlineProvides fastest predictable recovery time objective (RTO)Fast-Start Failover: no data loss, failover can be done in seconds
• If non-critical data blocks are corrupted• Perform RMAN Block Media Recovery (BLOCKRECOVER …) using a
valid datafile backup from the physical standby databaseUsed when a small number of blocks require media recovery and the blocks that need recovery are knownAffected datafiles will be online (except corrupt blocks)
• Perform RMAN restore and recovery using valid datafiles from the physical standby database• When entire datafiles have been corrupted
34
Utilizing Standby Database To Recover From Logical Corruptions
• Logical corruptions may result from running bad scripts, inadvertent deletes, incorrect updates, etc.
• Database may be operational without any database errors, but application behavior may be impaired
• Solution: use standby database to recover with minimal production downtime• Use Flashback Database on standby database to revert to a
known good state, open up standby, import data back into primary database
• Apply process on standby database may also be run in a delayed mode such that standby may not be affected at all by these logical corruptions
35
Operational ErrorsAnother Real-life Disaster
• Large telecom company• Multi-TB customer service production database serving millions of
customers set up with online redo logs that were not multiplexed• Had at least one large table (over 1 billion rows)• The only available online redo log was corrupted• Database instance was shut down• When it was tried to open:
ksedmp: internal or fatal errorORA-00600: internal error code, arguments: [2662], [1965], [349730312], [1965], [349743443], [666894377], [], []
2662: a data block SCN ahead of the current SCN, possibly due tosome physical corruption
• The production database had no standby database
36
Operational Errors … contd.• Restore old backup & recovery was estimated to take 1day+• Managed to open the database in degraded mode (1/3rd of apps)• Several other corruptions noticed• Used various means to try identifying corrupt blocks
dbms_repair.check_objectdbms_space_admin.tablespace_verifyDBVERIFY(RMAN) BACKUP VALIDATEANALYZE TABLE … VALIDATE STRUCTURE CASCADE ONLINE
• Final decision: rebuild the databaseCorruption spread was extensivePerformance problems running corruption checks on production
• After several days, managed to build a new database using a combination of transportable tablespace and log mining
37
Operational Error – Lessons Learned!
• If they had a Data Guard standby database• Could have simply switched over / failed over to the standby• Application would have been online in a few minutes• Corruption diagnosis and repair could have been done offline
• Having your mechanic do an engine repair on your car while you are driving on the freeway is a BAD idea• Most car service dealerships offer
you a rental car for a reason!
38
Data Guard – DR and Beyond
Server Failures
Network Failures
Availability Protection Utilization
StorageFailures
SiteFailures
DataCorruptions
OperatorErrors
PlannedMaintenance
ApplicationAccess
3. Efficient Systems UtilizationAdministrator utilization
Rolling database upgradesMigrate data centers, SANs, platformsUse physical standbys for backupsCloning / testing of production workload
End-user utilizationUse logical standbys for apps, reporting, read-access
39
SQL Apply – Rolling Database Upgrades
Major ReleaseUpgrades
Patch SetUpgrades
Cluster Software & Hardware Upgrades
Initial SQL Apply Config
Clients Redo
Version X Version X
1
BA
Switchover to B, upgrade A
Redo
4
Upgrade
X+1X+1
BA
Run in mixed mode to test
Redo
3X+1X
A B
Upgrade node B to X+1
Upgrade
LogsQueue
X2
X+1
A B
40
Other Planned Maintenance Activities Possible with Data Guard
• Data Guard: very effective way to migrate data centers / SANs• Create standby databases without any production impact• Keep them synchronized automatically
Possible to use incremental backups – see MetaLink Note 290814.1For expediting standby sync-up, refer to max_connections attribute
• Before the cut-off time, perform Data Guard switchover
• Switchover: another effective way to do configuration changes with minimal downtime
• Hardware changes on primary server: Data Guard is agnostic of underlying server / storage system
• Migration to RAC / ASM
• Selected platform migration – between:• HP-UX PA-RISC and HP-UX Itanium (see MetaLink Note 395982.1)• Linux distributions (e.g. RedHat and Suse) (on the same platform)• AMD-64 and Intel-64 (same OS)
41
Offload Backups to Physical Standby• Oracle Recovery Manager (RMAN) integrated with Data Guard –
backups can be offloaded to physical standby• On production server: saves processing cycles, no impact• Backups can be done while physical standby in recovery / open read-only• Backups can be used by primary / other standbys• Standbys can be created from RMAN backups with no primary downtime
• Operational best practices for backups on standby• Flash Recovery Area configuration (see MetaLink Note 331924.1):
Primary: CONFIGURE ARCHIVELOG DELETION POLICY TO APPLIED ON STANDBY;Standby: CONFIGURE ARCHIVELOG DELETION POLICY TO NONE;
• Use identical directory structures• Use RMAN recovery catalog• Use spfile for primary and standby databases
Refer to “Using Recovery Manager with Oracle Data Guard in Oracle Database 10g” http://www.oracle.com/technology/deploy/availability/techlisting.html
42
Physical Standby for Cloning / Testing
• A Physical Standby can be opened read/write for development, reporting, or testing purposes, and then flashed back to be a physical standby once again• When flashed back, Data Guard automatically synchronizes
the standby with the primary
43
Cloning / Testing … contd.• Operational Steps
Set up Flash Recovery Area on the physical standbyCancel Redo Applycreate restore point pre_clone guarantee flashback database;
Defer primary database connection to this standbyalter database activate standby database;
Open the “physical standby”Perform testing / reportingflashback database to restore point pre_clone;
alter database convert to physical standby;
Resume Redo ApplyReconnect primary to standby
• Excellent way to do testing on a production workload without impacting the production database
44
Scale-out with Logical Standby for Web-App
Primary Database
Logical standbys: scaling out read access (web content browsing)
Physical standby, sync transport, for DR
Server Farm approach
45
Offload Applications to Logical Standby• Use Logical standby to offload applications from production database
and save processing cycles• Applications well-suited: those that do a lot of processing with read-only
data, and produce interim data-sets that are not mission-critical enough to be disaster-protected
• Examples:A billing application that does not cause key database updatesAn application that schedules site visits for tech support personnelAn ETL tool that feeds data into a Data Warehousing application
• Operational best practicesChange GUARD setting: alter database guard standby;These apps may even be restricted to access only the logical standbyIf logical standby is RAC, the non-Apply nodes should be utilized for these appsWatch out for unsupported data types
46
Offload Reporting to Logical Standby• Similar to offloading application processing• Reporting apps may have certain special requirements
• Reporting may have to be on the latest data (real-time reports)Use LGWR SYNC for redo transportUse Real-Time apply
alter database start logical standby apply immediate;
• Reporting may require local write-access to summary tablesIf these tables also exist on primary, they need to be skipped
Stop applySkip apply on these tables (use dbms_logstdby.skip)Start applyChange GUARD setting: alter database guard standby;
• Reporting apps may require additional indexes / materialized viewsStop applyDisable guard for the session (alter session disable guard;)Create indexEnable guard for the session (alter session enable guard;)Start apply
• For RAC standby, the non-Apply nodes should be utilized for reporting
47
Data Guard – Defining the Next Era in Data Availability & Data Protection1. High data availability
Integrated high availability through Fast-Start Failover
2. Comprehensive data protectionProtection from data corruptionsProtection from operational errors
3. Efficient systems utilizationRolling database upgradesMigrate data centers, SANs, platforms, etc.Use physical standbys for backupsCloning / testing of production workloadUse logical standbys for apps, reporting, read-access
Availability Protection Utilization
48
Amazon.com – Beyond Custom Physical Standbys
49
Amazon.com – Life before Data Guard
• Major HA & DR requirements:• Guaranteed no data loss• Failovers should be as quick as possible• No impact on primary database’s availability due to standby
failures
• Amazon built scripts and programs to automatically maintain physical standby databases• Automatically copy, apply and archive redo• Monitor, Respond, Alarm to problems• Failover/Switchover is a manual process
50
Amazon.com – Life before Data Guard
• Issues with building a custom standby solution• Complexity of code to handle all failure cases• Solutions tied to site specific configuration of Oracle
• Compatibility with newer releases of Oracle and the Operating System
• Unique solutions require unique training for new DBAs• Longer than desired switchover• Primary is unaware of the standby and doesn’t care if it is
current• Does not scale to a large number of databases
51
Amazon.com – Supporting Physical Standbys using Data Guard
• Immediate replacement for custom recovery solution• 10gR1 Data Guard with Maximum Performance (ARCH
transport method)
• Data Guard is superior to our home grown solution• Switchover in minutes• Changes are pushed as they are generated• Data Guard handles redo shipping and gap recovery• Additional push/apply methods are available based on your
requirements• Some monitoring and recovery agents still required• Works across any environment that supports Oracle
52
Amazon.com – Improving Availability using Fast-Start Failover in 10gR2
• Fast-Start Failover meets Amazon’s availability requirements• Now possible to do very fast failovers without any data loss• Fast-start Failover will not commence if target standby is not
synchronized with the primary• Synchronization status shown through fs_failover_status
column in v$database • Since Fast-Start Failover is based on Maximum Availability,
primary database not impacted upon standby failures• Old primary automatically reinstated as new target standby
53
Amazon.com – Data Guard Wish List
• Increasing capacity with readable standbys • Physical standbys that could support Read-Only operations
during apply would become part of the application architecture and not simply part of the infrastructure
• Increasing options with support for low bandwidth WANs• The ability to support standbys in remote locations connected
by low bandwidth networks, would extend the reach of the infrastructure and removes distance as a barrier
54
Amazon.com – Summary Assessment
• Data Guard has fundamentally changed the way we look at and manage standby databases
• Fast-Start Failover has the potential to increase availability by an order of magnitude
• If the improvements in standby management provided in 10g are any indication, we can’t wait to get our hands on 11g
55
Data Guard Sessions from Oracle Development at Oracle OpenWorld
1. Monday 4:45pm - Session S281211 - Oracle Data Guard Customer Experiences: Practical Lessons, Tips, and Techniques, Moscone South 102
2. Tuesday 10:45am - Session 281207: Next Generation Oracle Database Availability: A Sneak Preview, Moscone West 2009-2011
3. Tuesday 1:45pm - Session 281212: Oracle Data Guard: Defining the Next Era in Data Availability and Data Protection, Moscone South 305
4. Wednesday 11:30am - Session 281210: Oracle Data Guard Tips and Tricks: Direct from Oracle Development, Moscone South 104
5. Thursday 12:30pm - Session 281208: MAA Best Practices: Building a Highly Available and Disaster-Proof Architecture, Using Data Guard, Oracle RAC, Automatic Storage Management, and Flashback, Moscone South 102
6. Thursday 2:00pm - Session 281209: MAA Best Practices: Reducing Downtime for Planned Maintenance Operations Using Oracle Database 10g, Moscone South 304
56
For More Information
http://search.oracle.com
orhttp://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html
&http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
Data Guard
57