39
7/22/2019 Emergency Handling Recovery http://slidepdf.com/reader/full/emergency-handling-recovery 1/39 - 1 - Emergency Handling for Recovery of SAP System Landscapes Best Practice for Solution Management Version Date: May 2008 The newest version of this Best Practice can always be obtained through the SAP Solution Manager Table of contents 1 Introduction..................................................................................................................................3 1.1 Goal of Document ..................................................................................................................3 1.2 What Is a Disaster?................................................................................................................4 1.3 Course of a Recovery ............................................................................................................4 1.4 Organization ..........................................................................................................................5 1.5 Dos and Don’ts in Case of a Disaster .....................................................................................5 2 Flowchart for Emergency Handling...............................................................................................6 3 From Incident to Disaster (Steps 1 to 5) .......................................................................................7 4 Error Categorization (Step 6)........................................................................................................8 4.1 Technical Failure....................................................................................................................8 4.2 Logical Error ..........................................................................................................................9 4.3 Cross-System Inconsistency ..................................................................................................9 5 Activating Alternate Procedures (Steps 7 to 8)............................................................................10 5.1 Switchover after Technical Failures ......................................................................................10 5.2 Workarounds after Logical Errors and Data Inconsistencies .................................................11 6 Preparations for Recovery (Step 9) ............................................................................................11 7 Executing Recovery (Step 10)....................................................................................................13 7.1 Overview of Recovery Phases.............................................................................................. 13 7.2 Technical Recovery (Recovery-Phase 1)..............................................................................14 7.3 Data Repair (Recovery-Phase 2)..........................................................................................17 7.4 Business Recovery (Recovery-Phase 3) ..............................................................................21 7.5 Data Re-entry (Recovery-Phase 4).......................................................................................24 8 Returning to Normal Operation (Steps 11 to 16).........................................................................25 9 Examples...................................................................................................................................27 9.1 Example 1: Media Failure.....................................................................................................27 9.2 Example 2: Media Failure and Database Recovery Failure...................................................28

Emergency Handling Recovery

Embed Size (px)

Citation preview

Page 1: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 1/39

- 1 -

Emergency Handling for Recovery of SAP System Landscapes

Best Practice for Solution Management

Version Date: May 2008The newest version of this Best Practice can always be

obtained through the SAP Solution Manager 

Table of contents

1 Introduction..................................................................................................................................3

1.1 Goal of Document..................................................................................................................3

1.2 What Is a Disaster?................................................................................................................4

1.3 Course of a Recovery ............................................................................................................4

1.4 Organization ..........................................................................................................................5

1.5 Dos and Don’ts in Case of a Disaster .....................................................................................5

2 Flowchart for Emergency Handling...............................................................................................6

3 From Incident to Disaster (Steps 1 to 5) .......................................................................................7

4 Error Categorization (Step 6)........................................................................................................8

4.1 Technical Failure....................................................................................................................8

4.2 Logical Error ..........................................................................................................................9

4.3 Cross-System Inconsistency ..................................................................................................9

5 Activating Alternate Procedures (Steps 7 to 8)............................................................................10

5.1 Switchover after Technical Failures......................................................................................10

5.2 Workarounds after Logical Errors and Data Inconsistencies .................................................11

6 Preparations for Recovery (Step 9) ............................................................................................11

7 Executing Recovery (Step 10)....................................................................................................13

7.1 Overview of Recovery Phases..............................................................................................13

7.2 Technical Recovery (Recovery-Phase 1)..............................................................................14

7.3 Data Repair (Recovery-Phase 2)..........................................................................................17

7.4 Business Recovery (Recovery-Phase 3) ..............................................................................21

7.5 Data Re-entry (Recovery-Phase 4).......................................................................................24

8 Returning to Normal Operation (Steps 11 to 16).........................................................................25

9 Examples...................................................................................................................................27

9.1 Example 1: Media Failure.....................................................................................................27

9.2 Example 2: Media Failure and Database Recovery Failure...................................................28

Page 2: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 2/39

- 2 -

9.3 Example 3: Lost Data...........................................................................................................30

9.3.1 Example 3a: All Data Can be Recovered......................................................................30

9.3.2 Example 3b: Remaining Data Loss is Only Locally Relevant.........................................31

9.3.3 Example 3c: Remaining Data Loss Causes Cross-system Inconsistencies ...................32

9.4 Example 4: Database Block Corruptions ..............................................................................33 Appendix ..........................................................................................................................................36

 A - Flowcharts for Printout .............................................................................................................36

Page 3: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 3/39

- 3 -

1 Introduction

1.1 Goal of Document

Disruptions of core business functions are critical to the success of a company. When businessoperations are disrupted, a standardized procedure can help to return to regular operations in a timelymanner. Meanwhile, activating technical switchover solutions or business-level workarounds canprovide an interim solution to keep up operations (at least at some minimum level) while regular functionality is being restored.

Using flow-charts, this document outlines a general procedure to be followed in case of seriousbusiness disruptions. Starting with the escalation of an incident to a disaster, the main phases andsteps that are part of the recovery procedure are depicted - allowing the classification of incidents andproviding details on the recovery options available in the different phases.

The purpose of this document is twofold:

1. Support the handling of an acute emergency

2. Provide input for Business Continuity Planning

Emergency Situation

Following this document, SAP support employees and customer support organizations will be able tofollow a structured approach to support the recovery a customer’s system environment within areasonable timeframe. This document provides information for each phase of a recovery procedure.By giving examples for some typical error situations and recovery approaches, the course of therecovery process becomes clear.

The procedure outlined here can also serve as the basis for an action plan to be set up for coordinating an emergency situation.

For a customer, this document will be helpful if there is no disaster recovery plan available or, if thereis, it can support the emergency handling process by supplementing a customer-specific plan.

 As such, this document is intended for:

  A disaster recovery team executing a recovery

  Support employees assisting customers in a business-down situation

  Duty managers / escalation managers accompanying a recovery

Business Continuit y Planning

Business continuity planning has the task of preparing a company for a disaster situation by creatingdetailed recovery plans for different contingencies. The general flow of an emergency proceduredescribed here can be adopted in a customer-specific recovery plan. The error categorization and

recovery options listed in this document can provide input for the detailed recovery instructions to bedefined.

 As such, this document is intended for:

  The business continuity project manager 

  Members of a business continuity project

Business Continuity Planning is also part of another Best Practice provided by SAP: “BusinessContinuity Management for SAP System Landscapes” , which covers the project steps of establishing a business continuity concept; see http://service.sap.com/solutionmanagerbp .

The recovery procedures created by a business continuity project are intended to guide emergencyhandling in a business-down situation.

Page 4: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 4/39

- 4 -

1.2 What Is a Disaster?

Within the scope of this document, a disaster  is any event that seriously disrupts business operationbeyond the acceptable outage time. An incident that cannot be resolved within a predefined time-limitneeds to be escalated to a disaster that requires further recovery procedures. This perception of theterm ‘disaster’ is often used in the context of Business Continuity Management and goes beyond the

more restrictive notion of a ‘physical disaster’ like a fire, flooding or explosion.Business disruptions or ‘disasters’ can be caused by technical failure or logical failure.

Technical failures of a system component usually affect all business processes that areusing the affected component(s). This can range from crashes of individual hardwarecomponents over database block corruptions, to building fires or flooding of an entirecomputer center.

Logical failures, on the other hand, often only affect single or few business processes whilethe systems are still up. Logical failures range from partial data loss or data corruptions insidea single system, to data inconsistencies of data being exchanged between multiple systems of an environment.

1.3 Course of a Recovery

Disaster recovery handling, as described in this document, starts with the escalation of an incident,which is seriously disrupting operations, to a disaster.

It is important to identify the type of error causing the disruption as early as possible, since therequired recovery phases and applicable activities mainly depend on the error type.

When a disaster is declared, the following main phases of recovery may be applied in this order:

 A. Activate possible alternatives to stay in business. This can be a technical switchover to astandby system, or the activation of alternate business processing using workarounds or emergency plans.

Which options are possible or applicable depends on the solutions in place and the actual typeof error. Activating a workaround will be easier and faster, if the workaround is alreadydocumented in a recovery plan.

B. Prepare systems for the recovery

C. If a system or component is down, system recovery or technical recovery, as a first step, hasto reestablish technical availability of the system by fixing any technical error causing thedisruption. This can be done, for example, by exchanging some defect hardware component,by activating a standby system or by restoring a database from a backup.

D. If all components are up (or were recovered in the previous step), logical errors inside eachsystem have to be removed to restore integrity of each system in itself. This requires in-depthapplication knowledge and is a prerequisite for the next step.

E. If data consistency between systems of the environment was affected, this again requires in-depth application knowledge and time to fix it.

F. If data was lost and could not be recovered so far, the next effort should aim at reenteringsuch data into the systems.

G. Having finished all recovery phases, the systems and business functionality should bechecked as a prerequisite for resuming regular operations.

More details on the different phases can be found in section 2 and following.

Page 5: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 5/39

- 5 -

1.4 Organization

The organization that executes the recovery plans consists of:

  The support desk (incident management) staff that report a possible disaster case

  The business continuity manager   The recovery team with representatives from business (key business user/business process

champion) and IT (application management/SAP technical operations/business processoperations)

  A crisis team of senior managers from business (business process champion) and IT(application management) that need to be consulted for critical decisions like the activation of a disaster recovery plan

  Key users familiar with emergency workarounds for critical business functionality

 A standardized method is recommended to support complex business continuity approach. Thisapproach involves all parties that can contribute to the resolution of a disaster.

The focus of the approach should be the analysis and resolution of top issues, which have the highestimpact on the operations of productive solutions in case of a disaster.

The benefits of a consolidated approach are:

  Standardized approach that has proven to be most effective and efficient

  Fast access to all experts needed

  Close collaboration and communication with all involved parties

  High transparency on current issue status

  Continuous reporting on the progress, up to management level

1.5 Dos and Don’t s in Case of a Disaster 

This section lists some general pitfalls during a recovery.

Don’ts Instead do

Do not apply point in time recovery of singlesystem

Do repair inconsistencies / logical errors of singlesystem

Do not apply point in time recovery for thesystem landscape

Do identify and correct inconsistencies / missingdata

Do not chase temporary differences Do make sure that it’s a real inconsistency, not a

temporary difference

Page 6: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 6/39

- 6 -

2 Flowchart for Emergency HandlingThis section describes the general flow of activities for handling a situation that impacts continuity of business operations. The following figure provides a flowchart that can also serve as the basis for a

specific action plan in an emergency situation.

Note: If available, a customer’s business continuity plan and corresponding recovery plans need to befactored in when performing a recovery.

Figure 1: Flowchart 1: “ Emergency Handling”

The different steps of the emergency handling process will be discussed in more detail in the followingsections:

Step Section Title

1-5 3 From Incident to Disaster 

6 4 Error Categorization

7-8 5  Activating Alternate Procedures

9 6 Preparations for Recovery

10 7 Executing Recovery

11-16 8 Returning to Normal Operation

Page 7: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 7/39

- 7 -

3 From Incident to Disaster (Steps 1 to 5)

Incident

 A business disruption is usually detected by end users who trigger an incident at the supportorganization. Incident management (support desk), as the primary addressee for any type of business

disruption, analyzes and tries to resolve the error by answering the following questions:

  What has happened?

  Which processes are affected?

  How many users are affected?

  Is the error reproducible?

  Is the error business critical?

Involved Organization: Support desk

Escalation

The situation has to be escalated to business continuity management:

  If error resolution is not successful within a given time frame defined in the SLAs or 

  As soon as it becomes clear that the error is of high criticality and complexity and has aserious impact on business operations

Business continuity management is responsible for further handling such major incidents. Beforeescalating to a disaster, further analysis has to answer the following questions:

  What is the impact on business operations?

  Is the incident endangering the business of the customer (production down)?

  What is the root cause of the problem?  What is the estimated time required for recovery?

  Should I invoke disaster recovery plan (escalate)?

If a serious business disruption (that prevents critical core business functions from operating) isdetermined, a disaster situation will be declared and the business continuity plan will be invoked.

Involved Organization: Support desk, BC Manager, Senior Management

Disaster 

In a disaster situation, the following questions need to be addressed by the disaster recovery team:

  Who do I have to call first?

  Is the incident of an isolated nature or will the consequences deteriorate over time (for example data inconsistencies that may spread if work continues in the system)?

  Is it possible to maintain partial functionality in the system or must the system be taken out of operation completely?

  Which workarounds are available; which are possible? See section 5.

  What needs to be done before starting recovery actions? See section 6.

  Which recovery options are available; which are possible? See section 7.

  Who else will be required in the recovery team (BC team)?

Involved Organization: BC Team

Page 8: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 8/39

- 8 -

4 Error Categorization (Step 6)The type of error (error category) determines not only the entry point for recovery execution (see step

10), but also the options for activating alternate procedures (business continuity solutions, see steps7-8). Therefore, the first requirement now will be to determine the type of error leading to thedisruption.

We can distinguish

  Technical failures

  Logical errors inside a system

  Cross-system inconsistencies affecting data exchanged between systems of a landscape

Involved Organization: BC Team

4.1 Technical Failure

 A technical failure is usually caused by a hardware or system software fault. The system is usuallyunavailable and thus, all users and business processes relying on that system are affected.

Technical errors can be of the following nature:

  System / subsystem crash

  Infrastructure failure (network, power, telephony, …)

  Failure of service provider 

  Physical disaster (fire, flooding, …)

Error causes

Technical failures can, for example, result from:

  Hardware failure (memory, CPU, controller, …)

  Storage media or storage system failure

  Software bug (firmware, operating system, filesystem, database, …)

  Database block corruptions

Block corruptions

Database block corrupt ions are a special kind of technical error. The content of storage blocks used

by the database is corrupted thus the data stored in these blocks can no longer be used. The impactof block corruptions can range from SQL statements or transactions failing when accessing the corruptblock, up to crashes of the complete database instance. If system data of the database managementsystem was corrupted, it is possible that the database instance may no longer be restarted.

The reason for block corruptions can be multiple – hardware failures like a defective disk controller,memory errors or low-level software bugs messing up the data.

 Although block corruptions are sometimes regarded as logical errors (since the data stored in theblocks is logically corrupt), we follow the categorization as a technical failure because the phases thatare applicable to recover from block corruptions start at the hardware / technology level (see section7.2). Fixing a corruption on the technical layer may sometimes result in missing or incorrect data (alogical error).

Page 9: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 9/39

- 9 -

4.2 Logical Error 

 A logical error usually affects only parts of a system or its data and thus only a few businessprocesses and a limited number of users. Since all data is consistent from a database and “SAPBasis” viewpoint, the systems are up and ‘only’ some business functionality is disrupted or faulty.

Logical errors can be of the following nature:

  Some business data is lost, ranging from complete tables to single table rows or single fieldsof table rows. If data is lost, the application context will be corrupted since related data stillexists in other tables

  Business data is falsified, ranging from single table rows to the contents of specific tablefields

  Reports or other software processes are inoperable

Error causes

Logical errors can result from software error or human error (business user or administrator error) like:

  Data deletion or table drops on SQL, database administration or SAP level

  Transport induced error (wrong destination, wrong transport buffer, …)

  Faulty customizing

  Introduction and execution of bad code

  Incorrect usage of application component, incorrect data entry

  Incorrect data transfer / incorrect data processing through interfaces

4.3 Cross-System Inconsistency

In a system landscape where business processes use and modify data in various systems, data

consistency is vital for correct business operation. A business object that is exchanged between twosystems and should thus be available in both systems is inconsistent (between the two systems), if:

  The object does not exist in any one of the systems

  The two instances of the same object have different values in both systems

 A special type of inconsistency in this respect is an inconsistency between an IT system and the realworld.

Difference or inconsistency

When talking about inconsistencies, it is important to differentiate between Differences andInconsistencies. While Difference relates to mismatches between data that will always occur inconnected running systems (due to the processing times of asynchronous update tasks, IDocs, BDocsand other interfaces, different scheduling frequencies between systems), Inconsistency means amismatch that does not disappear when all system activities are processed successfully. Beforeattempting a correction, it is therefore necessary to investigate whether an Inconsistency or atemporary Difference is observed.

Error causes

Inconsistencies between two (or more) systems can be caused by:

  Software errors

o  No clear leading system

o  Program bugs

o  Non-transactional interfaces,for example, synchronous communication used for data manipulations

o  Incorrect error handling

Page 10: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 10/39

- 10 -

  User errors / Manual intervention

o  Incorrect data entry

o  Incorrect error handling

o  Deletion of Queues

o  Direct access to data  Messages in error states

  Simplified Commit-Protocol (as used between APO and liveCache)

  Data loss in one of the systems

o  Incomplete Recovery of a system

o  Technical disaster recovery (data replication) method that does not adhere to dataconsistency

  Tolerated data loss, for example, with asynchronous replication

  Missing consistency technology

  System failure or failover 

o  Non-transactional interfaces with non-SAP components may be affected by data loss

5 Activating Alternate Procedures (Steps 7 to 8)Depending on the type of error, different possibilities may be available to continue operations duringrecovery:

Technical failure A technical continuity solution may allow you to switch over operations to an alternatehardware.

Logical errors or inconsistenciesWorkaround processes may be available to continue the most critical business functions.

5.1 Switchover after Technical Failures

Involved Organization: BC Team, IT

Server-side failureFailover on the server side using a cluster solution for database or SAP central services can be donewithout limitations.

Storage-side failure

If data is replicated to a second facility, a switchover to the alternate storage system must always beconsidered carefully. The decision must be made by the disaster recovery team after weighting thebenefits versus the impact. Since switchover may come along with some data loss, the demand for business recovery to remove cross-system data inconsistencies will come up.

The amount of data loss during switchover depends on the implemented replication technology.

  A standby database may incur a relatively high data loss if the most recent logfiles from

production cannot be applied.

  Asynchronous replication generally incurs data loss according to the allowed replication lag.

Page 11: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 11/39

- 11 -

  Even with synchronous replication, some data loss may occur if the primary location continuedto operate while the replication was already interrupted (“rolling disaster”).

Block Corruptions

If a standby database is available and complete recovery of this standby database can be done usingthe most recent logs from the production system, switchover to the standby database can be a veryquick solution to enable continuity of business operation, because block corruptions caused by atechnical failure usually do not transfer into a standby database.

If complete recovery is not possible on the standby database, more detailed analysis should beconducted into other possibilities of resolving the corrupted blocks (see 7.2), because a switchover would result in cross-system inconsistencies whose resolution might be somewhat more complex.

Note: Switching to a standby database will not solve the problem if the corruption was also transferredto the standby database, for example, if the block corruption was caused by bugs in thedatabase software.

5.2 Workarounds after Logical Errors and Data Inconsistencies

Involved Organization: BC Team, Key users

 Al ternate Processing

If a business process becomes unusable due to logical errors or inconsistent data, the businessprocess can only be re-activated after the error was resolved in a sufficient way. In the meantime, itmight be possible to “stay in business” using a workaround procedure.

 At best, such workarounds were already documented in a business continuity plan and can beactivated according to this plan. If this is not the case, it should be analyzed if any such workaroundsare possible and applicable to continue operations on a reduced scale.

The following types of workarounds may be considered:

  Manual, paper-based processing

  Operation based on the remaining systems of a system landscape

  Working with reduced functionality

  A combination of the above

Since a workaround always implies some limitations and usually requires some more or lessexpensive post-processing when normal operations are reestablished, the activation of a workaroundshould be under the control of the disaster recovery team.

6 Preparations for Recovery (Step 9)Before starting the actual recovery process, some preparations may be required to avoid unintendedside-effects. Depending on the actual situation, the affected system may need to be shut down or isolated from other systems of the landscape before error resolution can continue. Isolation from other systems may be required for example to prevent the exchange of messages before datainconsistencies have been resolved.

Consider the following preparations before starting with the recovery:

  Notify users

  Stop user access to production system(s)

  Isolate affected system

  Salvage possibly helpful information

Page 12: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 12/39

- 12 -

  Ensure you are able to revert the system to the point before you start the recovery

Involved Organization: BC Team, IT

Notify Users

Business users need to know about the disruption and must be given guidelines on how to proceed. If a workaround will be activated, the users must be instructed to use it.

Stop User Access

While recovery actions are performed, normal users may not be allowed to work with the affectedsystem or the affected business processes. This is mainly important during resolution of logical errorsor data inconsistencies. This can be achieved by disabling user logon to the system or by locking theaffected transactions. After a technical system recovery, user logon should be prevented until it wasverified that recovery has really completed and that no further recovery on logical level is required.

Possible actions:

  Lock users

  Lock affected transactions

  Lock system

Isolate Affected System

 As long as the state of the affected system is not completely clear, the system should be isolated fromits environment. Any automatic actions should be disabled; message exchange with other systems of the environment should be avoided, especially when expecting cross-system inconsistencies after data loss or incomplete database recovery.

Possible actions:

  Disable communication from other systems

o  In other systems: disable connections, deregister outbound queues/destinations,disable automatic data requests

o  In affected system: lock RFC user for incoming messages from other systems

  Disable communications to other systems

o  In other systems: lock RFC user for incoming messages from affected system

o  In affected system: disable connections, deregister outbound queues/destinations,disable automatic data requests

  Disable transports

  Disable print-outs

Salvage Possibly Helpful Information

Prevent data that may be helpful for recovery or analysis on application level being deleted (for example by automatic reorganization jobs). This mainly comprises information on messages that wereexchanged between the affected system and other systems of the landscape.

  Do not delete the contents of message queues unless you are sure that this information isavailable in the target system (also see section 7.4)

  Unschedule message reorganization in XI

  Unschedule BDoc reorganization in CRM

  Avoid deletion of ALE data

Page 13: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 13/39

- 13 -

Enable Reverting to the State Before Recovery

If recovery would not be successful or even enlarge the damage, it should be possible to revert backto the point before recovery started. This can be ensured by taking a backup before starting recoveryor by noting the exact point in time when recovery was started so a database restore and log recoverywould allow to revert to that point. If available, other technologies like savepoints or storage-basedsnapshots can provide even better solutions for this demand.

7 Executing Recovery (Step 10) At this stage, the actual recovery from the failure will be performed, using previously documentedrecovery procedures from a business continuity plan, if available. The available options depend on thetype of error and will lead to the flowchart in figure 2.

7.1 Overview of Recovery Phases

The recovery procedure can be divided into four different phases.

1. Technical Recovery

 Fix technical errors to get the system up and running

2. Data repair 

 Fix logical errors inside affected system

3. Business recovery

 Fix cross-system inconsistencies

4. Date re-entry

 Re-enter data that might have been lost

The following flowchart depicts the sequence of recovery actions, coming from step 10 of flowchart 1(figure 1). The entry point into this flowchart is determined by the error type (see section 4). Whichphases actually need to be executed depends on the type of error and the resulting outcome from aprevious phase. In section 9, you can find examples showing different paths for traversing theflowchart.

Page 14: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 14/39

- 14 -

Figure 2: Flowchart 1.1: “Recovery Phases”

7.2 Technical Recovery (Recovery-Phase 1)

Goal: Repair defective hardware or software components to get the database and SAP system  up and running

Involved Organization: BC Team, IT

Strategy for Technical Recovery

Systems or components may not be operational due to hardware failure, corrupted filesystems, lostsystem files, lost or corrupted database files, misconfiguration, corrupted software or software bugs,where software can be any kind of low-level system software, operating system, database or application software. These failures can be the consequence of defect hardware but may have variousother reasons.

Coming from phase 1 of flowchart 1.1 (figure 2), the following flowchart depicts the main steps to getthe systems back online.

Page 15: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 15/39

- 15 -

Figure 3: Flowchart 1.1.1: “ Technical Recovery”

Options for Technical Recovery

 Although error and root cause analysis may not be easy, the measures to fix technical failures arequite straightforward and can require any of the following hardware-, system- or database-relatedactivities:

  Fix hardware

o  Exchange defective hardware components

o  Switch to an alternate data center (in case of physical disasters)

  Fix software and filesystems

o  Check and repair filesystems

o  Restore filesystems (see below)

o  Restore or reinstall affected software, which can be system software, drivers,operating system, database software, application software, and so on

o  Install error-free software patches

o  Fix configuration errors

  Fix database

o  Restore database or database files (see below)

o  Resolve database block corruptions (see below)

Next step: If the failure or the recovery procedure did not cause any loss or falsification of database or application data, recovery can finish at this stage and proceed with step 11 returning tonormal operation. Otherwise, further applicable branches of the flowchart need to betraversed.

Page 16: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 16/39

- 16 -

Filesystem Restore

 A restore of storage volumes containing non-database managed filesystems with any kind of softwarecomponents or any other kind of data is always incomplete, which means that any changes done after the last backup are lost. Unlike databases, filesystems do not allow to apply changes done after thebackup by applying the concept of logfiles.

Since an SAP system does not store application data in filesystems (with very few exceptions for special kind of data like logfiles, TREX indexes or external content in CM), no impact on applicationdata consistency is to be expected after a filesystem restore. The loss of information may thus affectsoftware, configuration files, transport files or this special kind of data. Subsequently, analysis shouldbe conducted to find out what exactly was lost and then to repeat any activities that allowreconstructing the previous state (repeat installation of software patches, repeat configurationchanges, repeat export of transports). Strictly speaking, this already is an activity that should fall intosection 7.5 (Data Re-entry) but for simplicity we leave it here.

Database Restore and Recovery

 A database restore from a backup may be required in case of:

  Media (disk) failures

Since nowadays all productive installations implement some form of RAID protection; failure of a single disk no longer requires a restore. Only if more than one disk of a single RAID groupfails at the same time, restore of the data residing on the RAID group will be inevitable. Due tostriping that is implemented for performance reasons, a restore may affect multipletablespaces residing on that RAID group or even the complete database.

  Block corruptions (see below)

  Deletion of data files or misconfiguration of raw devices, for example, due to an administrator fault

 A restore always consists of three phases:

1. Restore of a database backup. Depending on the backup strategy, the restore can be donefrom different sources (tape, virtual tape, local disk, remote disk, standby database), yielding a

different restore performance.

2. Restore of database logfiles from the backup medium

3. Application of database logfiles to the restored database (log recovery)

Log recovery is a very important step to roll forward the database and to apply changes that weredone after the backup. During log recovery, all archived logfiles should be applied, followed by thecurrent online logs that were being written when the failure occurred. The goal should always be toperform a complete recovery including the latest committed transaction. Only a complete recoveryavoids data loss, which is important to maintain data consistency in a system landscape. Only byapplying all available logfiles (archive and online logs), can a complete recovery be achieved.

Note: Despite of the status of aborted messages being exchanged between the systems, completerecovery maintains data consistency in an SAP system landscape because of the transactional

concept deployed for message exchange. Doing a complete recovery, all committed messagestates are restored exactly as they were at the time of the failure. The asynchronous messagingprotocol used for data exchange in SAP environments (tRFC, qRFC, EO or EOIO messaging)then ensures that message exchange can be continued without losing or duplicating messages.

 An incomplete recovery (point-in-time recovery) of the database needs to be avoided since thisintroduces the need to remove cross-system inconsistencies that are caused by the data loss in theaffected system (see below).

Example: For an example for a complete recovery after a technical failure see section 9.1.

Next step: Following a complete database recovery, recovery can finish at this stage and proceed withstep 11 returning to normal operation.

Page 17: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 17/39

- 17 -

Incomplete Database Recovery / Data Loss

In very rare situations, an incomplete database recovery (point-in-time recovery) may be inevitablefollowing a database restore. Complete database recovery is generally not possible if:

  A required logfile is corrupt and no error-free copy of the logfile is available

  The tapes storing the logfiles are destroyed

  The most current online logs cannot be accessed and applied to the database andthere is no other replica of these online logs available

 After an incomplete recovery, the database itself is in a consistent, error-free state and the affectedsystem can be started (unless other errors are still present). However, the data loss has an impact oncross-system data consistency in a system landscape and as a next step, the business recoveryphase needs to address these issues.

Example: For an example for an incomplete recovery see section 9.2.

Next step: Following an incomplete database recovery, data consistency between systems needs tobe re-established and subsequently, the lost data needs to be re-entered in the system.Therefore, recovery phases 3 and 4 are required.

Block corruptions

If you recognize block corruptions on your database, always check your hardware because blockcorruptions are mostly caused by layers below the database management system. To determine thereal extent of the damage, also check your entire database.

The actions that need to be taken as a consequence of block corruptions depend on the affecteddatabase areas. The analysis should thus clearly identify the objects that the corrupted blocks belongto.

On Oracle, SAP note 365481 describes different options to proceed depending on the type of object. Additional options are available as follows. Please note that some options require expert knowledge!

  Restore and recover single corrupt blocks from an error-free backup with Oracle RMAN. Thisis possible even if RMAN was not used to perform the backup. This recovery is possibleonline.

  Restore and recover database from an error-free backup

  Additional options of SAP Support

o  Rebuild from redundant data (from other table, from the indexes, from other system,and so on)

o  Workaround with partial data loss (transformation of technical error to logical error)

Example: For an example for handling database block corruptions see section 9.4.

Next step: Depending on the result and method used to recover from block corruptions, recovery mayfinish at this stage or may require traversing additional recovery phases 2, 3 or 4.

7.3 Data Repair (Recovery-Phase 2)

Goal:Remove logical errors or inconsistent data inside a single system

Involved Organization: BC Team, Business, IT

Page 18: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 18/39

- 18 -

Do not Perform Point-in-Time Recovery

Up to now, a commonly accepted method to remove logical errors from a system was to restore andrecover the database to a point before the error occurred (database point-in-time recovery). The dataloss that came along with this proceeding was acquiesced in favor of the ease of reestablishing logicalcorrectness of the system.

But nowadays, with business processes spanning federated system landscapes, this methodis no longer appropriate in most cases!

While removing the logical errors through point-in-time recovery, a new problem is introduced – cross-system data inconsistencies (see error category 3 described in section 4.3). So instead of having torepair logical errors inside a single system, cross-system inconsistencies had to be dealt with, which,in most cases, is an even more challenging task (see ‘Business Recovery’ in section 7.4) thanrepairing the logical error directly inside the affected system.

The following table provides a comparison of ‘data repair’ versus ‘point-in-time recovery plus businessrecovery’:

Data repair of logical errors inside

single system

Database point-in-time recovery

followed by repair of cross-systeminconsistencies

Required knowledge Experts from affected application Database administrators,experts from all application areasthat exchange data with affectedsystem

Outage Outage of affected businessprocesses

Outage of complete system andmany cross-system processes

Duration of outage Time required to fix error Time required for database restoreand recovery plus time required tofix cross-system inconsistencies

In situations where a point-in-time recovery was the easiest solution in the past, more investment intological error resolution is advisable with federated system landscapes (see ‘options for data repair’).

Strategy fo r Dealing with Logical Errors

To avoid point-in-time recovery of the production system, logical errors should be carefully analyzedand possible options to repair the errors should be evaluated. If the effort turns out to be very high, thisshould be compared against the effort and implications imposed by a point-in-time recovery (includingall follow-up activities).

If a point-in-time recovery was nonetheless identified as the best of the evil, data consistency betweensystems needs to be re-established subsequently and the lost data needs to be re-entered in the

system. Thus recovery phases 3 and 4 will be required.

The following flowchart depicts the steps for recovering from logical errors; coming from phase 2 of flowchart 1.1 (figure 2).

Page 19: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 19/39

- 19 -

Figure 4: Flowchart 1.1.2: “ Data Repair”

Options for Data Repair 

The following general approaches are available to fix logical errors and will be described in more detaillater:

  Reverse Engineering

  Recovery of lost data

  Check tools

  Doing nothing

Reverse Engineering

 A typical method to resolve logical errors is reverse engineering, which means reverting the error stepby step with the help of the experts (application, development, and so on). Reverse engineering can

be supported by an analysis system that allows you to track back to the state when the error was notyet in the system.

Such an analysis system could be provided by different means:

  Perform a point-in-time recovery onto an alternate hardware to the state before the error occurred (not a restore onto production)

  A standby database that has a sufficient delay to production can be rolled forward to the pointbefore the error occurred

  Use the state of a (recently copied) test system to compare with the corrupted productionsystem

 As described in SAP note 434645, there are various possibilities that may be applicable to repair logical errors, for instance if:

Data was corrupted by a malicious report

o  Develop a report to fix the data

Page 20: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 20/39

- 20 -

o  Provide an analysis system and reconstruct original data from there

 An index is corrupted

o  Recreate index

Wrong transports were imported into the system

o  Create and apply correcting transportso  In case wrong table data was transported, reconstruct former table contents following

the options above (see table deletion)

o  Reconstruct former ABAP sources

Recovery of Lost Data

If data was accidentally deleted (by deletion of a database table, drop of a table, deletion of table rowsor attributes by a malicious report or human error), there may be several options of getting this databack without restoring the production system itself (as also listed in SAP note 434645).

  Provide an analysis system (as described above) and reconstruct original data or databasetable from there

  Reconstruct original data or database table from a standby database that is rolled forward tothe point before the error occurred

  Oracle: Flashback table to SCN (if Undo-information is still available)

  Reconstruct table from redundant data in other tables

  Reconstruct table from redundant data in other systems

  Do without the data (for example, performance data of table MONI)

Example: For an example for recovering a deleted table from an analysis system, see section 9.3.

Check Tools for Specific Appli cations

SAP offers several tools or reports to check and repair data consistency of business data used bydifferent applications. Checks are, for example, available for:

  Documents in SD and LE

  Inconsistencies in MM

  Inconsistencies between MM and FI

  Processes involving WM

  Processes involving PP

  Processes involving PS

For more information see the Best Practice “ Data Consistency Monitoring within SAP Logistics”that will be available at http://service.sap.com/solutionmanagerbp .

Doing Nothing

In some special situations, logical errors may not require any further action, for example if the affecteddata is not vital for business operations.

  If non-critical data (like logfiles or monitoring data) was deleted there is no need for recovery

  If non-critical data was corrupted, it could just be deleted

Further Steps

When finishing the data repair phase, the data of the affected system should be correct from abusiness process point of view.

Page 21: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 21/39

- 21 -

However, in some situations, repairing a logical error may not have been able to recover all lost or corrupted information completely. If data repair came along with some data loss, further analysis willhave to show:

  If the lost data has an impact on data consistency between the systems – in this case, datarepair has to be followed by a phase of business recovery

  How important this lost data is and how the lost data can be re-entered into the system – by asubsequent phase of data re-entry

Next step: Depending on the outcome and success of data repair, further recovery may requiretraversing the additional recovery phases 3 or 4.

7.4 Business Recovery (Recovery-Phase 3)

Goal:Remove inconsistencies between the systems of a system landscape

Strategy fo r Dealing with Cross-System Inconsistencies

The task of this phase is to deal with inconsistencies that occur between systems of the systemlandscape. Similarly, inconsistencies between systems and the real world need to be handled in thisphase as well.

When dealing with cross-system inconsistencies, we need to know:

  Which systems are affected

  Which business processes are affected

  Which data objects are affected

  What is the impact on each business process

The tasks for removing such data inconsistencies include:

  Identifying inconsistencies by comparing possibly affected objects between possibly affectedsystems

  Filtering out temporary differences which do not constitute real inconsistencies

  Determining a strategy to fix the identified inconsistencies

Involved Organization: BC Team, Business

The following flowchart depicts the steps during business recovery; coming from phase 3 of flowchart 1.1 (figure 2).

Page 22: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 22/39

- 22 -

Figure 5: Flowchart 1.1.3: “Business Recovery”

Options for Business Recovery

The following general approaches are available to remove inconsistencies between systems and willbe described in more detail later:

  Application- or object-level options; addressing inconsistencies by comparing and fixingbusiness objects

  Initial load; addressing inconsistencies by retransferring inconsistent data from a leadingsystem

  Message-based approaches; addressing inconsistencies by repeating the message transfer 

The most suitable approach must be determined for each business object and may thus be differentfor each type of inconsistency.

Dealing w ith Pending Messages

Handling of non-processed messages (pending messages) contained in the message queues of theinvolved systems plays an important role during business recovery because:

  They have an influence on the differentiation of inconsistencies from differences and

  On the one hand they may contain data that can be salvaged but

  On the other hand they may contain data that might lead to duplicates or logical errors if theywere processed.

Depending on the state of each of the systems involved in business recovery, it must be determinedhow each of these message queues has to be handled. The options are:

  Delete pending messages because the related data objects will be handled completely by thecompare and resynchronization process

  Delete pending messages because they contain data that is already available in the other systems

Page 23: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 23/39

- 23 -

For example, after an incomplete recovery, the outbound queues of the recovered systemmay be deleted because that data is already available in the connected systems (unless thesequeues were stopped and the messages were thus not processed).

  Process pending messages because they contain important information that can be rebuiltthat way

For example, after an incomplete recovery:o  The inbound queues of the recovered system may contain valuable data that may

need to be processed to recover that data.

o  The outbound messages in all other systems should be processed because theycontain data that is not yet available in the recovered system. To preserve a correctorder of these messages, however, it might be required to postpone their processinguntil all data objects will be compared and fixed.

 Appl ication- / Object-level Options

SAP offers a number of tools or reports to compare data objects between different systems. Many of these tools also allow fixing of (delete or re-transfer) inconsistent objects. The following tools arecurrently available from SAP:

  CRM: Data Integrity Manager (DIMa) to check and correct business objects between CRMand ERP as well as CRM and CDB (consolidated database for mobile applications)

  CRM: data exchange toolbox to check and correct one order documents (SAP note 718322)

  SCM: Tools to check internal and external consistency of business objects between APO andliveCache respectively APO and ERP

For more information see the documentation for each of these tools and the overview that can befound in the Best Practice “ Data Consistency Monitoring within SAP Logistics”  that will beavailable at http://service.sap.com/solutionmanagerbp .

If business objects are affected by inconsistencies where no SAP tools are available, the followingoptions may be evaluated:

  Compare objects with customer-developed tools

  Check for the availability of not officially released SAP “developer tools”

  Compare and fix objects manually

  Identify possibly affected objects by:

o  Evaluating creation or change date

o  Comparing mapping tables in both systems

o  Analyzing logfiles providing hints whether data was exchanged in the period inquestion (for example logfiles written by the Communication Station about dataexchanged with CRM Mobile Clients)

o  Analyzing information about exchanged documents (for example BDoc message storein CRM)

Example:  For an example for executing application-level business recovery as a consequence of dataloss in a system of a federated landscape see section 9.2.

Initial Load

By reloading inconsistent objects from a leading system, inconsistencies can be removed in one go.Initial load may thus be an option for specific business objects, for example, master data if this doesnot result in the loss of important additional attributes being maintained in the target system.

Message-based Approaches

 Apart from comparing and correcting inconsistent business objects with object-specific methods, somecases may allow you to identify and correct inconsistencies by analyzing the message transfer thattook place between the systems. A prerequisite is that the messages that were transferred during the

Page 24: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 24/39

- 24 -

period in question are still available in the systems or system logs. The idea is to fix inconsistenciesthat are caused by lost data by re-sending previously processed messages. But we have to note thatthe maintenance of number ranges and newly assigned numbers may represent an issue with thisapproach.

Whether information on messages that were exchanged is still available depends on the type of communication being used between the systems in question:

  ALE (IDOC)The “ALE Recovery Tool” (Transactions BDRL, BDRC) allows you to analyze and resendmessages. For more information see:http://help.sap.com/saphelp_erp2005vp/helpdata/en/26/29d829213e11d2a5710060087832f8/f rameset.htm

  RFCFor RFC communication, an analysis of past message transfer is not possible since no logsare kept in the systems.

  BDocFor BDocs between CRM and ERP, information on the data exchange can be retrieved fromthe BDoc Message Store (transactions SMW01 and SMW02). For data transfer between CRM

and mobile clients, the Mobile Client Log could be evaluated.  Communication via SAP XI

SAP XI keeps track of all messages from and to sending systems. In case of EO or EOIOmessaging, information on the messages can also be found in the sending and receivingsystems. For RFCs that were routed via XI, information will only be available in XI but not inthe sending or receiving systems. Currently there is no tool available to support the analysis or re-sending of messages.

  File interfacesData that was uploaded via file interfaces could be recreated by repeating the file upload, if the file is still available. Thus, data consistency to external applications could be reestablished.

Further Steps / Remaining Data Loss

If business recovery became necessary due to some data loss in a system of the landscape, businessrecovery using the above options might have been able to bring back some of the lost data bytransferring it from another system that holds a second copy of the data. But usually, not all lost datacan be recovered that way, for example:

  Data objects that were not exchanged with other systems

  Special attributes of data objects that do not exist in the other systems where the object wasreplicated from

Next step: If business recovery was not able to completely recover all lost data, further recovery willproceed with phase 4, trying to re-enter lost data into the system.

7.5 Data Re-entry (Recovery-Phase 4)

Goal:Get back lost data in a single system that could not be recovered by previous phases

Description

Incomplete database recovery or resolution of database block corruptions in phase 1 might havecaused the loss of data for a complete period of time and, respectively, a very isolated, partial loss of objects.

Data repair for logical errors in phase 2 might also have caused some loss of data objects or attributes.

Page 25: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 25/39

- 25 -

Business recovery in phase 3 might have been able to get back some lost data from other systems of the environment, but in most cases not to full scale.

 At this stage, the data available in the systems should be consistent within and between the systems.So what is left now is the task of re-entering any information into the system(s) that is still lost after (or due to) the previous recovery phases.

Involved Organization: BC Team, Business, Key users

Procedure

In general, the knowledge about which data is lost should be quite exact at this point in time – or atleast the period that is affected by the data loss can be restricted very well. So, now, any such datashould be re-entered into the system as comprehensively as possible.

There are two options to approach the data re-entry phase:

  Key users get access to the system and re-enter lost data before the system is returned backto regular operation

  Data re-entry is postponed and done by some key users or by the normal business users after handover to production.

The best time and method to re-enter lost data depends on the nature of the affected data and bothapproaches may be taken, dependant on different business objects.

The following options may apply to get back lost information:

  Users enter the data from written notes or from memory

  Data is re-entered from an external input stream (batch input, file input, upload tools,transports, and so on)

  Data is recovered from a copy of the old, corrupt production system

  Data is recovered from a (recently copied) test system

Next step: Recovery has finished and checks should verify that recovery was indeed successful andthat the system is ready to return back into productive operation.

8 Returning to Normal Operation (Steps 11 to 16)

Checks

When recovery and error resolution has finished, checks are needed to verify if the system has reallyreached an error-free state that allows returning back to normal operations.

Checks should verify:

  Functional operability of business processes

  Correctness and consistency of business data

The approval whether the system is ready for production will be taken based on the results of thesechecks. The decision should be taken by the business continuity manager, together with applicationmanagement.

If recovery quality was not sufficient to return to production, recovery must be continued until asatisfactory state is reached.

In specific situations, it may be possible to “partially” handover the system, which means that somebusiness process might be excluded for a limited time. This could be the case for example, if the

system in general reached a sound state that allows continuing business operations with the exceptionof some business processes that need further fixing. A prerequisite for partial handover is a clear separation of the released and the non-released processes and their data.

Page 26: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 26/39

- 26 -

Involved Organization: BC Team, Key users, Senior Management

Handover to Production

 After handover to production, regular users are allowed to log on to the system and work with their usual functionality. All established workarounds will be called off.

Completing Data Integrity

 After handover to production, some follow-up activities may still be required, for example to:

  Re-enter lost data into the system that was purposefully not yet covered during recoveryphase 4 (section 7.5).

  Allow business users to identify and re-enter lost data that was not identified by the recoveryteam. To enable users to check on their data, they need to be informed in detail about thecritical period of recovery.

  Integrate data, which was created while using the workaround processes, back into the

regular system. Depending on the nature of the workaround or alternate process, such datamay be available on paper or in other systems.

Involved Organization: BC Manager, Users

Leave Disaster Status

When users signal that all data has been recovered to the best of their knowledge, and when theremains of workarounds have been successfully integrated back into the productive system, thisdisaster case can be closed.

Lessons Learned

Having left the disaster status, follow-up activities should further investigate the root cause of thedisaster with the goal of avoiding similar situations in the future. To learn for future emergencies, thecomplete disaster handling process should be reflected to identify possible areas of improvement inthe business continuity plan.

Involved Organization: BC Manager 

Page 27: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 27/39

- 27 -

9 Examples

9.1 Example 1: Media Failure

Error Scenario:

The SAP system runs on an Oracle database located on a RAID-protected storage system.

Two disks out of the same RAID group fail. Multiple Oracle tablespaces are located on this RAIDgroup. A backup containing the lost files and the complete changelogs (Online and Offline redologs)are available for restore and recovery.

Figure 6: Recovery Flow for Example 1

Recovery Phase 1:

Execute restore and complete DB recovery

The Oracle database can only be mounted because the datafiles are missing. By accessing the viewv$recover_file in mount status, you can find out which datafiles need to be restored from a backup.

The latest backup taken before the crash containing the missing datafiles is identified in the directory/oracle/<SID>/sapbackup. The missing datafiles are restored with SAP’s tool brrestore.

The Backup logfile contains information about the redolog in use when the backup was taken.The database view v$log contains the information what the current redolog file is.

 All redologfiles not available on disk anymore are restored from tape with SAP’s tool BRrestore. After making all files available on disk, a recovery of the database is started with ‘recover database;’ inthe Oracle tool sqlplus.

Page 28: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 28/39

- 28 -

Details of this procedure can be found in note 4161.

 As a rough estimation for the required restore time for datafiles and redologs, we can assume that it isapproximately the same as the backup runtime.During log recovery, approximately 50-500 MB of redolog volume can be recovered per minute. Thisdepends on the hardware and can be much better estimated after applying the first redologs.

Recovery Phase 2:There are no logical errors in the application data of this system and recovery phase 2 is not needed.

Recovery Phase 3:

There is no data loss. Since the database was recovered completely (including the latest committedtransaction), all messages that were just being exchanged between the systems when the crashoccurred, reflect exactly the state at that point in time. All messages are restartable and can beprocessed as before. Recovery phase 3 is not needed.

Recovery Phase 4:

No data was lost; recovery phase 4 is not needed.

9.2 Example 2: Media Failure and Database Recovery Failure

Error Scenario:

This example assumes the same error scenario as example 1. However, in this case, completerecovery is not possible. During database log recovery, it is recognized that one logfile needed for recovery is defect and cannot be applied to the database. Therefore, recovery needs to be aborted. Analysis of the timestamps of the logfiles shows that the recovered state of the database lies 2 hoursbefore the time when the database crashed. This means that 2 hours of business data is lost andcannot be recovered by technical means.

Page 29: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 29/39

- 29 -

Figure 7: Recovery Flow for Example 2

Recovery Phase 1:

Database recovery ended with an incomplete recovery of the database. The database and the SAPsystem can be started, but 2 hours of data is lost. Nonetheless, the system is in a consistent state (asit was 2 hours before the media failure).

Recovery Phase 2:

There are no logical errors in the application data of this system and recovery phase 2 is not needed.

Recovery Phase 3:

The data loss has an impact on data consistency between the systems of the landscape. Therefore,business recovery is required.

Example business scenario: The ERP system is the leading system for material master records. Themajor part of material master is created or changed in ERP and subsequently loaded to the CRMsystem. However, inside CRM the users also create competitive materials to track similar products of competitors for statistics about lost sales opportunities.

The CRM system had an incomplete recovery and has now a state that is two hours older than theERP system, implying that some material updates are now missing.

 As part of the recovery, most material masters can be recreated by a repeated load from ERP to CRM.For instance, you could use report RSSCD100 to evaluate all change documents for ERP materials,creating a list of materials for a corrective load. First, check whether recently changed materials arestill in their messaging phase (that is, ERP outbound queue, CRM inbound queue or CRM inboundBDoc in validation error) to identify temporary differences. With the remaining list of materials, you cancreate a so-called request download definition in CRM Middleware to extract materials from ERP and

load them into CRM again.

Page 30: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 30/39

- 30 -

Recovery Phase 4:

Still some data is lost, which now needs to be re-entered manually.

The competitive materials were not available in the ERP system and therefore could not be recreatedautomatically. You would have to manually recreate the lost competitive materials.

Checks: Afterwards, it is recommended to run a full comparison of material masters between ERP and CRMwith the CRM DIMa tool (Data Integrity Manager).

9.3 Example 3: Lost Data

Error Scenario:

This example wants to present the handling of a logical error. We assume that due to a user fault, dataof a complete application table was deleted.

Imagine the CRM table CRMM_TERRITORY was dropped by a user error. Now, the complete CRM

Territory Management application becomes unusable.Recovery Phase 1:

This phase is not applicable. Since the error is categorized as “logical error”, recovery starts withphase 2.

To demonstrate that logical errors can require very different measures during the course of recovery,the example is now separated into cases 3a, 3b and 3c.

9.3.1 Example 3a: All Data Can be Recovered

Figure 8: Recovery Flow for Example 3a

Page 31: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 31/39

- 31 -

Recovery Phase 2:

Recovery is done using the following procedure:

Figure 9: Handling of Logical Errors Using an Analysis System

Steps:

1. Block user access

2. Unload / export new data (if applicable)

3. Restore database to an “analysis system”

4. Recover analysis system close to the error 

5. Unload / export data from analysis system6. Insert data into production (repair)

7. Merge rescued data with repaired data

 Attention: Details depend on many factors (like error type, affected objects, and so on), so carefulanalysis and application knowledge is required.

Result of recovery phase 2 in example 3a: Recovery of data in table CRMM_TERRITORY from theanalysis system is completely successful, no data was lost. Thus, recovery phases 3 and 4 are notrequired.

9.3.2 Example 3b: Remaining Data Loss is Only Locally RelevantRecovery Phase 2:

Recovery of table data could not bring back the table completely. Further analysis of the data lossshows that this does not impact objects or attributes that are exchanged with other systems. Thus,recovery phase 3 is not required, but phase 4 needs to be applied to re-enter the lost information.

 As in the example above, the CRM table CRMM_TERRITORY was dropped by a user error and theTerritory Management application becomes unusable on the CRM system. However, a completerecovery failed and only a part of the table was restored to its original state.

Page 32: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 32/39

- 32 -

Figure 10: Recovery Flow for Example 3b

Recovery Phase 3:

Not required for example 3b.

Recovery Phase 4:

Now, the remaining missing table entries of CRMM_TERRITORY need to be recreated manually. Thecomplete list of key fields (territory GUIDs) is still available in related tables, for example in the territorystructure table CRMM_TERRSTRUCT or in the territory validity table CRMM_TERRITORY_V. Withgood knowledge of the data model, there is a chance to recapitulate the structure of the missingentries of the main table.

9.3.3 Example 3c: Remaining Data Loss Causes Cross-system Inconsis tencies

Recovery Phase 2:

Recovery of table data could not bring back the table completely. Further analysis of the data lossshows that the lost objects are relevant for data exchange with other systems. Thus, recovery phase 3(business recovery) is required to reestablish data consistency between the systems.

We take again the example above with the loss of CRM table CRMM_TERRITORY. Again, as inexample 3b, a complete recovery failed and only a part of the table was restored to its original state.However, the situation gets more complex now, as in the new scenario we do not only have a TerritoryManagement application on the CRM server, but also on CRM Mobile Clients (laptop application). Wethus need to reestablish data consistency between the CRM Server and the CRM Mobile Clients.

Page 33: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 33/39

- 33 -

Figure 11: Recovery Flow for Example 3c

Recovery Phase 3:

 As a first important step, we need to identify the affected systems. In our example, the connectedCRM Mobile Clients get a periodic update of the territory structure by a regular background run of program CRM_TERRMAN_DOWNLOAD. This updates the CDB (consolidated database for themobile scenario) and sends delta messages to the Mobile Clients.

 As an emergency step, it is a wise idea to deactivate this delta update job, to prevent uncontrolleddistribution of incomplete table entries to the Mobile Clients.

On the other hand, the CDB database can be a source for reconstructing missing entries of the CRMtable. It has all entries up to the last delta update in a comparable data model (table SMOTERR andothers). Therefore, the keys and attributes of the territories can be collected there and serve as asource for manual recreation of the missing territories. For a larger amount of missing records, it isalso feasible to develop a small ABAP program for this task.

Recovery Phase 4:

 As last step, we need to identify which territories were not recreated. Newly created records after thelast delta update run have to be recreated manually without any reference system.

9.4 Example 4: Database Block Corruptions

Error Scenario:

During normal system operation, corrupt table blocks in an Oracle database are discovered.

 A full consistency check on all database files is triggered (as described in SAP note 23345), returningno further corruptions.No regular consistency check was done in the past, so it cannot be guaranteed that there is a backupavailable containing an old, not corrupted version of the blocks. A consistency check now performed

Page 34: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 34/39

- 34 -

on the oldest backup available shows the same corruptions. Restore and recovery of single corruptdatablocks or datafiles is therefore not an option.

The hardware partner of the customer and the hardware partner’s SAP Competence Center isinvolved to check the hardware.

Figure 12: Recovery Flow for Example 4

Recovery Phase 1:

Trying to access data in non-corrupt blocks fails when a corrupt block has already been read. Torestore full access, at least to the non corrupt data, a new version of the table needs to be createdcontaining all data from the non-corrupt blocks, but without any corrupt blocks. This can be achievedin general by reading ’around’ the corrupt blocks and copying everything from the corrupt table to a

“clean” table. After renaming, the “clean” table will finally become the original table and contain thereadable data from the corrupt table minus the data from the corrupt blocks. The new table now iscorruption-free, but some data was lost.

This example shows that solving a technical issue can imply the need for corrections in further recovery phases.

Block corruptions are different from other technical failures because the actions in the HW- and DB-related phase may depend on the possible actions in a later recovery phase. Therefore, it is necessaryto involve application experts already in the first phase, to decide how each corrupt table should bedealt with. Depending on the specific table, other options than the above procedure of copying thetable and loosing the corrupt rows may be possible:

- If all columns of the table are in at least one index, retrieve the data from the indexes without

reading the table blocks- If the table contains redundant data from another table, create a new, empty version of the

table and refill it with application reports (for example, for tables BSIS, BSAS and so on) or on

Page 35: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 35/39

- 35 -

the fly, during normal system operation (for example, for ABAP-Load- or ABAP-Dynpro-Tables)

- Recreate the table empty, if it contains data that may be less important and will not cause anyharm by being deleted. Candidates for this category are tables containing log information,statistics, and so on (for example, MONI).

In the above scenario, we assumed that no special handling for the table was possible. As much dataas possible needs to be copied to the clean table. As much information as possible has to be gatheredabout the lost rows (for example, the primary keys of the lost rows), so that in later recovery phases,the logical inconsistencies can be repaired efficiently by application experts.

While preparing for the copy of the corrupt table, the system is still in use. Only the business processaccessing the corrupt table on exactly the corrupted data cannot be executed.

Copying the data to the clean table requires system downtime, because during the copy, concurrentupdates to non-corrupt areas of the table need to be avoided. The duration of the downtime can beestimated by a test-copy (copied data is deleted afterwards), while the system is still in use.The duration of the copy largely depends on the hardware and unforeseen problems caused by thecorruption. In the worst case, the copy gets stuck and Oracle support has to be involved. Therefore,SAP insists on performing a test-copy – as always in such cases. After the test run, an appropriate

downtime is planned as soon as possible.For copying the data, creating the indexes on the clean table and gathering the information of the lostrows, the SAP internal tool Clean Copy for Oracle (note 796399) is used.

Once the physical recovery of the table is done, the application experts can continue with the nextrecovery phase. They will get the information how many table rows were lost, together with a currentlist of the key values of the lost rows. They may now decide that further downtime is required duringthe next recovery phases, or that it is possible to go live with some restrictions after finishing phase 1and to execute the following recovery phases while the system is already released for production

Recovery Phases 2, 3 and 4:

Depending on the achievable quality of error resolution resulting from phase 1, it may be required toproceed with further recovery phases. In comparison to example 3, similar cases to 3a), 3b) and 3c)

might be considered. This possible flow of error handling is indicated as ‘possible’ in figure 12. Sincethe handling of these errors is generally done in the same way as described in example 3, we do notrepeat it here.

Page 36: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 36/39

- 36 -

Appendix

 A - Flowcharts for Printout

This appendix repeats the flowcharts provided in this document for print-out.

Page 37: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 37/39

- 37 -

Flowchart 1: “ Emergency Handling”

Flowchart 1.1: “Recovery Phases”

Page 38: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 38/39

- 38 -

Flowchart 1.1.1: “Technical Recovery”

Flowchart 1.1.2: “Data Repair”

Page 39: Emergency Handling Recovery

7/22/2019 Emergency Handling Recovery

http://slidepdf.com/reader/full/emergency-handling-recovery 39/39

Flowchart 1.1.3: “ Business Recovery”