Planning For Catastrophe with IBM WAS and IBM BPM

© 2015 IBM Corporation

Planning for Catastrophe withIBM WebSphere Application Server &IBM Business Process Manager

Tom Alcott STSM

Chris Richardson STSM

This Session

• This session will focus on the architectural and operational issues that need tobe considered when planning and implementing a Disaster Recovery plan withWebSphere Application Server and IBM BPM. Topics will include use ofmultiple data centers, geographic separation constraints, supporting softwarecomponents, disaster recovery and other common deployment issues. Thoughfocused primarily on WebSphere Application Server and IBM BPM this sessionalso applies to IBM Middleware that is deployed on WebSphere ApplicationServer as well as Pure Application System.

• While not a prerequisite, attendees should be familiar with the materialcovered in " Preparing to Fail, Practical WebSphere Application Server HighAvailability”

Introduction

• Why Are We Here?

• To Avoid This

Agenda

• Concepts

• Disaster Recovery• Multiple Cells and Data Centers

• WebSphere Application Server Recovery

• IBM BPM Recovery

• Final Thoughts

Definitions

• Redundancy• The provision of additional or duplicate systems, equipment, etc., that

function in case an operating part or system fails, as in a spacecraft.

• Isolated• Separated from other persons or things; alone; solitary

• Independent• Not dependent; not depending or contingent upon something else for

existence, operation, etc.

• All of the Above are Fundamental for Effective High Availability andDisaster Recovery

Definitions

High Availability (HA)• Ensuring that the system can continue to process work within one

location after routine single component failures

• Usually we assume a single failure

• Usually the goal is very brief disruptions for only some users forunplanned events

Continuous Operations• Ensuring that the system is never unavailable during planned

activities

• E.g., if the application is upgraded to a new version, we do it in away that avoids all downtime

Continuous Availability (CA)

• High Availability coupled with Continuous Operations

• No tolerance for planned downtime

• Little unplanned downtime as possible

• Very expensive

• Note that while achieving CA almost always requires anaggressive DR plan, they are not the same thing

Definitions

Background: High Availability in one picture

7

IP Sprayer

Node Agt

Node1

Server 1Cluster “A”

Web Server

User Registry

DMgr

IHS

IHS

Server 1

Server 2

Node Agt

Node2

Server 2

Server 2

SharedFilesystem

WAS Txn Logs

Database

Storage (SAN)

Server 2

Cluster “B”

Cluster “C”

• Clustered IP Sprayer and Firewalls (notdepicted)

• Clustered HTTP Servers

• WAS-ND Cell with Clustered ApplicationServers

• User Registry (LDAP) Hardware Clustered withShared Disk

• Database Hardware Clustered With SharedDisk

• JMS Provider (Not Depicted)• WAS Messaging Engine with Shared Disks/DB• External JMS with Hardware Cluster and

Shared Disk

• Transaction Logs on Shared File System

• Clusters of “2” Provide High Availability,

• Don’t Forget “Rule of 3”,

• When Using With Clusters of 2

• An Outage (Planned or Unplanned) ReducesCapacity by 50%

• Is No Longer Fault Tolerant

Disaster Recovery (DR)

• Ensuring that the system can be reconstituted and/or activated at another location andcan process work after an unexpected catastrophic failure at one location

• Often multiple single failures (which are normally handled by high availability techniques)is considered catastrophic

• There may or may not be significant downtime as part of a disaster recovery

• This environment may be substantially smaller than the entire production environment, asonly a subset of production applications demand DR

• Normally based on justifiable business need.

• Recovery Time Objective (RTO)

• Service Recovery with little to no interruption

• Recovery Point Objective (RPO)

• Data Recovery and acceptable data loss

Definitions

• Service Levels (SLAs ) cover many things, our focus is availability aspects

• You need a clear set of requirements that define precisely the availabilityrequirements of the system, taking into account

• Components of the system A system has many pieces and business aspects, how do their requirements differ?

‒ Responsiveness and throughput requirements 100% of requests aren't going to work perfectly 100% of the time

• Degraded services requirements Does everything have to meet the responsiveness requirements ALL the time?

• Dependent system requirements What are the implications if a system on which you depend is down?

• Data Loss• Application Data

• Application State (Is this Critical in a Disaster?)

• Maintenance Change occurs, how does that affect availability?

• Disaster Recovery The unimaginable happens, then what?

Definitions

HA Service Level Example

SLA ExternalCommitment

SLA Internal Target

Service Timeframe 7 x 24 7 x 24

ApplicationProcessing Availability

99.5% per month 99.7% per month

Recovery TimeObjective

4 Hours 1 Hour

Maintenance Window Tue-Thurs3:00 - 6:00 am

Tue-Thurs3:00 - 6:00 am

99.5% = 3.60 HoursDowntime/Month

99.7% = 2.16 HoursDowntime/Month

DR Service Level Example

SLA ExternalCommitment

SLA Internal Target

Recovery TimeObjective

16 Hours 4 Hours

Recovery PointObjective

~ 0 (No Data Loss) ~ 0 (No Data Loss)

Note: This is Recovery of an Entire Data Center with 100’s of Servers,Application, Database, Messaging, etc

Agenda

• Concepts




• Final Thoughts

Stage 0 DR – a sound HA strategy

• HA is cheaper and Less Complex Than DR .

• A Robust HA Solution prevents small failures frombecoming disasters

• Don’t let a (relatively) minor failure become acatastrophe

• Eliminate all single points of failure in your primarydatacenter

• Spread Workloads Across Multiple Servers (andHypervisors ! )

• Add 2nd Production WAS-ND Cell

• Consider DB Replication, DB2 HADR, Oracle RAC inconjunction with hardware clustering

• LDAP/Registry Replication

• Otherwise, an HA event could force you to enactyour DR procedure

• Database is only replicated HA in Different DataCenter

13

IP Sprayer

Node Agt

Node1

Server 1Cluster “A”

Web Server

User Registry

DMgr

IHS

IHS

Server 1

Server 2

Node Agt

Node2

Server 2

Server 2

SharedFilesystem

WAS Txn Logs

Database

Storage (SAN)

Server 2

Cluster “B”

Cluster “C”

Multiple Data Center Options (1/4)

• Classic DR

• Active/Passive

• Two Data Centers, one Serving Requests the other Idling

• Independent Cells

• Easier Than Active/Active

• User and Application State Synchronization are Less Critical

• Asynchronous Replication Is Likely Sufficient

• Lower Cost for Network and Hardware Capacity

• From a Capacity perspective One Data Center is Being Underutilized.

• Typically Does Not Incur S/W License Charges When Idle

• If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern?

• WebSphere License Provides for

o Hot – Processing Requests License Required

o Warm – Started But Not Processing Requests, License Not Required

o Cold – Installed, But Not Started, License Not Required

• DB2 and MQ Require < 100 % of Hot Licenses for Replication


• “Active/Active” with Single Set of Active Databases

• Two Data Centers

• Independent Application Cells

• Serving Requests for Same Applications

• Database(s) Only Active in One Data Center

– Additional Latency for Application Data Requests from Remote Data Center

– Request Processing Interruption When Data Replica is Promoted to Primary


• Classic “Active/Active”

• Two Data Centers

• Independent Cells and Synchronized Resource Managers (DBs)

• Serving Requests for Same Applications

• Requires Shared Application Data

o Application Data Consistency is prerequisite to any other planning

o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication

• Additional Hardware and Disk Capacity Required

• e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition

• Expectation of Continuous Availability and Transparent Failover

o Requires Sharing Application State

• Expectation Seldom Realized

• Outage of One Data Center, Stops Disk Writes in Both, No Longer “Transparent”

• Synchronous Disk Replication Limits Geographic Separation

• Hardest and Costliest to Achieve

Note: Disk Replication only employed for Application Data and Application State, WAS-ND cell configuration, software

updates, and application maintenance should maintained independently in order to insure isolation (and availability)


• Hybrid “Active/Active” (Partitioned by Applications)

• Two Data Centers• Independent Cells with replicated Resource Managers

• Both DC’s Serving Requests, Both DC’s Configured for All Applications• Running Different Applications (With Different Application Data)

– New Application Tests

– One DC Performing Updates, One DC Performing Inquiry Only (e.g. datawarehouse)

• No Shared Application State, No Shared Application Data– Asynchronous Replication Sufficient

• Global Network Switch Used to Partition/Distribute Traffic

• In the Event of a Disaster• Users failover from one DC to the other

• Likely Some Interruption– As Data Replica is Promoted to Primary

– During Failover Workload Startup

• Provides Most of the Benefits of “Classic Active/Active” without the Cost andComplexity

The CAP Theorem

• In a distributed environment, especially spanning data centersacross LANs and WANs there are three core requirements for aservice:

• Consistency– Either the service works or fails– Traditional ACID of databases provides consistency and isolation

• Availability– Extremely important in web business model– In a large distributed system, one may have to compromise with

consistency for the sake of availability• Partition Tolerance

– Network partition will happen when not all machines are connected– “No set of failures less than the total network failure is allowed to

cause the system to respond incorrectly” – Seth and Lynch– Quorum is used to guard against split brain syndrome

• Brewer’s CAP conjecture states that• One can achieve only two not all three of the above mentioned

requirements

http://en.wikipedia.org/wiki/CAP_theorem

18

Multiple Active Data Centers and the CAP Theorem

• Active/Active requires you to sacrifice either consistency, availability orpartition tolerance.

• All three aren’t possible

• If you choose full availability, then you are going to lose guaranteedconsistency.

• So you need to design with this in mind, and build in mechanisms(typically involving queuing technologies) that enable your system to"tend towards“ consistency.

• Your data is going to be in two places, either partitioned or replicated.

• If the former, what happens when one site is down?

• If the latter, what happens when users hitting each site see slightlydifferent versions of the current state?

• These are very complex problems.

• Which is why I try to steer customers away from active/active and intoan active/passive model with DR from active to passive.

• But they always feel like they are wasting hardware……………!

19

Data Center Utilization Urban Legends

• Legend

• Active/Active Improves Utilization

• Reality

• An Active/Active Topology at 40-50% Utilization in Each DC IsEquivalent to An Active/Passive Datacenter Deployment with OneActive at 80% to 90 % Utilization and the Other Passive

• Running Active/Active at Greater Than 50% Of Total (bothDatacenters) Capacity Can Often Result in a Complete Loss of ServiceWhen a Data Center Outage Occurs

o Insufficient Capacity in Remaining Data Center to Handle > 100% CapacityResults in

• Poor Response Time (at best)

• Network and Server Overload, Resulting in a Complete Crash

Active/Active - What’s Wrong With This Picture ?

A former employer of mine had two data centers, running active/active

at two facilities approximately 2.6 miles (or 4.2 KM) apart.

• Close Proximity Addressed Data Consistency Concerns ……..But………

What Happens When ?

• There’s an earthquake

• There’s a Civil Insurrection

• A Hazardous Chemical Spill Occurs• And The Wind Is Blowing the Chemical Cloud from West to East (or vice versa)

• Your DC May Not Be Located in a Locale Prone to Earthquakes• But what about the other catastrophes ???• They can, *and* will happen !!

• There’s No Substitute For Isolation Between Data Centers

• Data Centers Should Be Sufficiently Distant So That a Single Event Doesn’tImpact Both !!

• This Likely Mandates Asynchronous Replication• Active/Active No Longer Practical

Network Latency and Application Data Consistency – A 3rd

Party Perspective

• Since the latency or round trip time for a network is usually correlated to thelength of the network, or the physical distance between the two end points (inthis case the primary and standby), Maximum Protection and MaximumAvailability modes are not recommended for Data Guard deployments over aWide Area Network (WAN). Note that this recommendation is driven by thelaws of physics (speed of light limitation) - the greater the distance of anetwork, the longer it will take for data packets to traverse the network, andhence the longer it will take for primary database transactions to commit.

• http://www.oracle.com/technology/deploy/availability/htdocs/dataguardnetwork.htm

Multiple Cells and Data Centers

• Your Network Team Assures You That Can (or Have) Constructed a Network LinkBetween Data Centers• For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency is

NOT an issue Under Normal Conditions• Even so, WANs are Less Reliable than LANs.

o And Much Harder To Fix !

• But You’re Missing The Point !• Network Interdependency Between Data Centers Means That the Data Centers are Not

Independent

• Question• Do You Want to Have to Explain to Your CIO Why A Problem In One Data

Center Impacted The Other and Resulted in a Outage Because You Didn’t

Have Cells Aligned to Data Center Boundaries ?

How Do I Recover WebSphere Application Server ?

• File System or OS Backup and Recovery• Disk or Tape• WAS backUpConfig/restoreConfig

– WAS_PROFILE/properties, ../etc, WAS_ROOT/java/jre/lib/*properties,WAS_ROOT/java/jre/lib/security,

• Build from Scratcho Only a Realistic Option with Complete Set of Scripts and Rigorous Change Control

• Best Options• File System/OS Backup & Recovery• backupConfig/restoreConfig for Deployment Manager

– WAS V8.0 and above, addNode –asExistingNode can Reconfigure Each Note AfterrestoreConfig of Deployment Manager Configuration

• Both From Last Know Working Production Configuration• Otherwise No Assurance Recovery Will Succeed• Same Concern with “Build From Scratch”‒ If Using Virtualization Consider VM Cloning for Install and Configuration‒ Consider Smart Cloud Orchestrator for Automated Install and Configuration

o Provision Both Primary Site and DR Site in a Consistent Manner with SCOo Note: Don’t Deploy to Backup to DR Site over a WAN !

• May Need to Change Cell and Host Names• Will the Original Data Center Be Restored, Or Is it Gone (for Good)?

WAS Full Profile DR Recovery

• Transaction Recovery on Separate (Physical) Server

• Access the Transaction Logs

– Move/Mount the Transaction Logs to Physical Server Hosting Application Server withAccess to Same Resources (e.g. JDBC, JMS)

– V8.x Optional Use of DB for Transaction logs

• If Recovery Occurs in Different Cell use wsadmin to Configure the Same JAASAlias for Accessing XA Resources

– With adminconsole the node name gets prefixed to the alias.

• No Longer Required to Have Same Hostname and IP Address

– Different IP’s with Multiple Host Alias’s Typicalhttp://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjta_mvelog.html

• WAS Messaging Engine

• recoverMEConfig AdminTask Command Retrieves MEUUID From PersistentMessage Store and Updates Message Engine Configuration

• Allows Recovery of Stranded Messages After Catastrophic ME Failure in WASV8.5.0 and above

• http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rjk_recoverme_config.html

• Some Small Capacity in DR Site Needs to be set Aside for Recovery

• In Addition to Production Workload

• Or Production Workload Not Processed Until Recovery is Complete

WAS Liberty Profile DR Recovery

• Transaction Recovery on Separate (Physical) Server

• Access the Transaction Logs

– Move/Mount the Transaction Logs to Physical Server Hosting Application Server withAccess to Same Resources (e.g. JDBC, JMS)

– V8.x Optional Use of DB for Transaction logs (Not supported for Production)

• Restore server.xml Backup (e.g. zip)

• No Longer Required to Have Same Hostname and IP Address

– Different IP’s with Multiple Host Alias’s Typical

• WAS Messaging Engine

• Restore server.xml Backup (e.g zip)

• Point to Copy of Messages on File System (Could also employ zip/unzip)

• Some Small Capacity in DR Site Needs to be set Aside for Recovery

• In Addition to Production Workload

• Or Production Workload Not Processed Until Recovery is Complete

Classic DR for Stateful Applications: full cell replication

2828

IP Sprayer

Node Agt

Node1

Msg.mem1Messaging

Web ServerUser Registry

DMgr

AppTarget

Support

DMgr

IHS

IHS

App.mem1

Sup.mem1

Node Agt

Node2

App.mem2

Sup.mem2

Filesystem (NFS)

Node Agt

Node1

Msg.mem1

DMgr

IHS

IHS

App.mem1

Sup.mem1

Node Agt

Node2

App.mem2

Sup.mem2

WAS Txn Logs WAS Txn Logs

Consistency Group

SAN Replication

for Application Data

File Copy

for Install & Config Data

Database

Storage (SAN)

Primary Datacenter Secondary Datacenter

Msg.mem2 Msg.mem2A P A P

Filesystem (NFS)

User RegistryWeb Server

Database

Storage (SAN)

Consistency Group

IP Sprayer

Messaging

AppTarget

Support

DMgr

DR via Stray Nodes & Database Managed replication

2929

IP Sprayer

Node Agt

Node1

Messaging

Web Server

DMgr

AppTarget

Support

IHS

IHS

App.mem1

Sup.mem1

Node Agt

Node2

App.mem2

Sup.mem2

IP Sprayer

Node3

DMgr

IHS

IHS

App.mem3

Sup.mem3

Node4

App.mem4

Sup.mem4

WAS Txn Logs WAS Txn Logs

User Registry

Primary Datacenter Secondary Datacenter

Database Database

DB-managedReplication for

Application Data

Msg.mem1 Msg.mem2A P

Msg.mem3 Msg.mem4A P

Node AgtNode Agt

User Registry

Web Server

IBM BPM: Licensing Guidance for HA/DR Configurations

Configuration

What’s active?

What licenses are needed for the backup nodes?DB2

WASND

BPM

“Classic” Disaster Recovery(DR): SAN-based replication

off off off• Files in the backup data center are being synchronized automatically by aSAN. But there is no DB2, WAS ND, or BPM program active.• No extra DB2, WAS ND, or BPM licenses needed for backup nodes.

DR configuration with OSreplication for config data & DB2

replication for runtime dataON off off

• Active DB2 HADR Standby setup considered warm standby – licenses for100 DB2 PVUs required to cover warm standby servers• BPM and WAS ND are inactive – no extra WAS ND or BPM licensesneeded for backup nodes.

DR configuration with WAS NDreplication for config data &

DB2 replication for runtime dataON ON off

• Active DB2 HADR Standby setup considered warm standby – licenses for100 DB2 PVUs required to cover warm standby servers• WebSphere process in the backup nodes used for synchronization – WASND licenses are required.• BPM is inactive* – no extra BPM licenses needed for backup nodes.

High Availability (HA) ON ON ON• IBM BPM is active in all nodes – full BPM licenses are required.• WAS ND and DB2 licensing based on Supporting Programs terms of BPM

Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license.(*Inactive: BPM server JVMs in the remote datacenter are not started)

REFERENCES• IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backupnodes. However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program mustbe licensed. E.g., when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPMservices are not active, no BPM licenses are required in backup nodes.

• DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from aClassic DR configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the serverrecovery time after a disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active.

•DeveloperWorks article on licensing DB2 10.1 servers in a HA environment

What Your Mother Didn’t Tell YouAbout Disaster Recovery

• Transaction Log replication & Network configuration requirements

• Historically, the WebSphere transaction service required IP Address & Hostname at the target server match thesource

• This is because, for some types of transactions, the server network information is written into the logsthemselves

• As of WAS v8 servers, this requirement is relaxed a bit – IP addresses no longer need to match

• High Availability and Disaster Recovery for the Deployment Manager

• Techniques are available that leverage hardware clustering or replication of the DMgr’s cell configuration to analternate server. These are described in detail at:

https://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html

• Beginning in version 8.5.5 WebSphere supports High Availability features for the Deployment Manager, using ashared filesystem. WAS installations (including BPM) running on applicable versions can leverage this feature:

http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/twve_xdsoconfig.html

• Because these HA techniques rely on DMgr replication, they apply unchanged to DR scenarios

– Generally, we recommend recovering application servers before bringing up a replacement DMgr

• A note about logical corruption – data integrity problems that get replicated to the DR environment

• We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state

• This can be done at the replica, to avoid interfering with normal operations

• If the Primary data and its replica are both corrupted, then state can be restored to a copy made before thecorruption

• For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? Thatway, if one datacenter is lost, another can carry the load

• See ‘Active/Active Antipattern’ discussion on the following slides

31

Active/Active anti-pattern (Cells Spanning Data Centers)

Secondary DatacenterPrimary Datacenter

32

IP Sprayer

Node Agt

Node1

Msg.mem1Messaging

Web Server

User Registry

DMgr

AppTarget

Support

IHS

IHS

App.mem1

Sup.mem1

Node Agt

Node2

App.mem2

Sup.mem2

Node Agt

Node3

DMgr

App.mem3

Sup.mem3

Node Agt

Node4

App.mem4

Sup.mem4

Msg.mem2 Msg.mem3 Msg.mem4A P P P

IP Sprayer

Web Server

IHS

IHS

Filesystem (NFS)

WAS Logs

Consistency Group Consistency Group

SAN Replication

for Application Data

Database

Storage (SAN) Storage (SAN)

User Registry

Filesystem (NFS)

Database

Why is this type of topology considered an Anti-Pattern?

• Active/Active approaches introduce new complexities that undermine the stability of the system

• Issues/Problems Can Propagate From One DC to the Other

• This Compromises Redundancy and Resiliency

• Worst Case a Outage Cascades Across Both Data Centers

• Frequently these negate the advantages that led the customer to consider the approach in the first place• Increased risk of network instability can lead to partitioned network (‘split brain’)

– Independent Transaction “Recovery” in Both Data Centers By HA Manager– The two data centers could move to inconsistent transactional states!!

• Increased network latency can limit system performance during normal operations– Latency between the Application Server and its databases– Latency among cluster members communicating via the WebSphere HA Manager

component• Desire to automate failover increases risk of false failover & rapid cycling• A system more than 50% utilized introduces the risk that losing a single component will

compromise the entire system, turning what could have been a (simple) HA event into a truedisaster

• In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all:• Attempts to limit latency lead to datacenters physically near each other, increasing the risk that

a single disaster will eliminate the entire system• Many disasters arise from human error and data corruption. Tight coupling between DR

resources does not provide protection from this type of failure at all• A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO

• Refer to

• http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d

• http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html

33

What are the recommended alternatives?

• Properly plan a High Availability solution distinct from Disaster Recovery

• Eliminate single points of failure through redundancy in network and software components• HA features allow rapid and automatic recovery from loss of a single component. Utilize them!

• Improve RTO by reducing complexity, scripting operational procedures and drill

• Automate Processes for Repeatability and Consistency

– Scripting

– Point and Click” is Not Repeatable

• Discipline and Practice are Essential

• Well Defined Procedures for Every Contingency

– You Do Not Want to Learn During an Outage

– Practice Those Procedures

– Won’t Make Mistakes in Crisis

– Validates that Procedures Actually Work

– Practice Backup and Recovery, System Failures, Disaster Recovery, etc.• Goal: Make Daily Operations Boring

• Improve electricity distribution via Uninterruptable Power Supply

• Utilize application design patterns like loose coupling in order to improve application flexibility

• In cases where RTO between 1 and 4 hours is necessary, without the requirement to process new work,consider the Stray Node pattern

34

Is This Different for the Liberty Profile and Liberty Collectives?

• No• Same Fundamentals for Effective Redundancy and the

Requirements for Isolation and Independence Apply

• Though Liberty May Make It Easier to Ignore or Believe Thatthe Fundamentals Don’t Apply

35

Agenda

• Concepts




• Final Thoughts

Disaster Recovery

• Develop a Disaster Recovery Plan

• Group Business Needs and Associated Applications into Tiers

• Group into tiers based on the hard/soft dollar impact on theorganization

• Categorize by RPO and RTO.

• The top tier likely includes zero data loss and either no downtime orperhaps just a few minutes of down time

• Subsequent tiers have an RTO of 24 hours, then 48 to 72 hours,then perhaps 72 to 96 hour

• Essential Part of Any Plan• Who approves DR move/recovery ?

• Automated site failover is a bad ideao Typically triggering DR is very expensive

o You do not want to trigger a DR by accident because of some transient issue – just makesthe situation worse

Disaster Recovery Objectives

• Recovery Time Objective

• How quickly the system will be able to accept traffic after the disaster

• Shorter times require progressively more expensive techniques

o e.g., a tape backup and restore is relatively inexpensive

o e.g., a fully redundant fully operational data center is very expensive

• One challenge is detection time

• It takes time to determine you are in a disaster state and triggerdisaster procedures

o While you are deciding if you are down, you are probably missing your SLA.

o Does the RTO include detection time?


• Recovery Point Objective

• How much data you are willing to lose when there is a disaster

• Limiting data loss raises costs

o e.g., restoring from tape is relatively inexpensive but you'll lose everythingsince the last backup

o e.g., asynchronous replication of data and system state requires significantnetwork bandwidth to prevent falling far behind

o e.g., synchronous replication to the backup data center guarantees no dataloss but requires VERY fast and reliable network and will significantly harmperformance

• Warning: results in increased latency which means capacity must be increased atall layers


• Most RTO and RPO goals will deeply impact application and infrastructurearchitecture and can't be done “after the fact”

• e.g., if data is shared across data centers, your database and applicationdesign will have to be careful to avoid conflicting database updates and/ortolerate them

• e.g., application upgrades have to account for multiple versions of theapplication running at once which can affect user interface design, databaselayout, etc

• Extreme RTO and RPO goals tend to conflict

• e.g., using synchronous disk replication of data gives you a zero RPO but thatmeans the second system can't be operational, which raises RTO

• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive

Disaster Recovery Testing

• The DR hardware Should Be Put Into Actual Production Usage

• Otherwise How Can You Be Sure It Will Work When You REALLY Need It.

• A Corollary of Murphy’s Law

• The larger the numbers, the less likely all Tier 1 machines can be successfullyrestored to operations.

• DR Testing Options

• “Saturday Afternoon Surprise”

– Unannounced DR Test

– Only If You Can Tolerate an Outage

• Progressively More Realistic and Complex Tests

– Startup of Remote Infrastructure

– Remote Startup with Simulated Workload

– Remote Startup with Production Workload Shift

• Other Issues In a Real Disaster

• Will your key staff want to travel?

• Will they be able to travel?

41

Example and (Very High Level) DR Plan

• Executive/Management Approval for Activation of DR

• Isolate Data Centers

• Halt Incoming Network Traffic

– Static “Temporarily Unavailable Web Page”

• Break Disk Synchronization

• Sever Network Links Between Data Centers

• Start and Recovery of Surviving Center

• Restore/Recovery Hardware and Middleware

• Start DB, Messaging and Application Servers

• Examine DB and Message provider logs for pending transactions

• Recover Pending transactions and messages

• Start Accepting New Work in Surviving Data Center

• Enable Network

42

Other Aspects to Consider

• An HA, CA or DR Deployment Architecture is Not a Product Feature.

• WAS and the WebSphere portfolio products

– Provide HA Features and Function

– Can Be Employed in an HA Architecture

– The Appropriate Environment Varies by Customer

– One Size Does Not Fit All !

• Optimizing WebSphere HA Capabilities into a Robust Deployment

• Requires In-depth Understanding

– Of Environment

– Of Applications

– Of Operational Requirements (Service Levels)

• Architectural Advice May Require ISSW Assistance

43

Learn from Your Mistakes

• Mistakes and failures will occur, learn from them• What separates mediocre organizations from the good and great isn't so

much perfection as it is the constant striving to get better – to not repeatmistakes

• After every outage perform• Root cause analysis

– Capture diagnostic information

– Meet as a team including all key players to discuss

– Determine precisely what went wrong• Wrong doesn't mean “Bob made an error.”

• Find the process flaw that led to the problem

– Determine a corrective action that will prevent this from happening again• If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected

– Implement that corrective action• All too often this last step isn't done

• Verify that action corrected problem

• A senior manager must own this process

Indispensable When Planning for Catastrophe

• Think !

Questions?

Notices and DisclaimersCopyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced ortransmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract withIBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has beenreviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBMshall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY,EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OFTHIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFITOR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of theagreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal withoutnotice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples arepresented as illustrations of how those customers have used IBM products and the results they may have achieved. Actualperformance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products,programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do notnecessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neitherintended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legalcounsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’sbusiness and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice orrepresent or warrant that its services or products will ensure that the customer is in compliance with any law.

Notices and Disclaimers (con’t)

Information concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products in connection with thispublication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBMproducts. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.IBM does not warrant the quality of any third-party products, or the ability of any such third-party products tointeroperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR APARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under anyIBM patents, copyrights, trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise DocumentManagement System™, Global Business Services ®, Global Technology Services ®, Information on Demand,ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™,PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®,urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks ofInternational Business Machines Corporation, registered in many jurisdictions worldwide. Other product andservice names might be trademarks of IBM or other companies. A current list of IBM trademarks is available onthe Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

Thank YouYour Feedback is

Important!

Access the InterConnect 2015Conference CONNECT AttendeePortal to complete your sessionsurveys from your smartphone,

laptop or conference kiosk.

Backup Slides

51

Shameless Self Promotion

IBM WebSphere Deployment and Advanced ConfigurationBy Roland Barcia, Bill Hines, Tom Alcott and Keys BotzumISBN: 0131468626

52

Another Recommended Book

IBM WebSphere v5.0 System AdministrationBy Leigh Williamson, Lavena Chan,Roger Cundiff, Shawn Lauzon and ChristopherC. Mitchell

ISBN: 0131446045

Licensing Servers as Back Up Servers

From IBM Contracts and Practices Database• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes HOT-

WARM-COLD backups:

• All programs running in backup mode must be under the customer's control, even if running at another enterprise's location.

• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started.

• There is no charge for this copy.

• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing anywork of any kind.

• There is no charge for this copy.

• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, thisprogram must be ordered.

• There is a charge for this copy.

• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include otheractivities such as mirroring of transactions, updating of files, synchronization of programs, data or other resources (e.g. activelinking with another machine, program, data base or other resource, etc.) or any activity or configurability that would allow anactive hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur

Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information

53

Software

Planning For Catastrophe with IBM WAS and IBM BPM