29
© 2011 VMware Inc. All rights reserved Confidential Business Continuity & Disaster Recovery in Virtual & Cloud Environments Liam Ferrel

Presentation disaster recovery in virtualization and cloud

  • Upload
    xkinanx

  • View
    84

  • Download
    2

Embed Size (px)

Citation preview

© 2011 VMware Inc. All rights reserved

Confidential

Business Continuity & Disaster Recovery in

Virtual & Cloud Environments

Liam Ferrel

2 Confidential

Basic Outline for this session….

CAMPUS

METRO / SYNC

DISTANCE SYNC / ASYNC

3 Confidential

Availability Design Myths

One solution fits all

One solution is cheaper

One is easier to implement

One is easier to manage its just “Next > Next > Finish”

So you and this powerpoint will tell me which solution I need?

• No

• All solutions have pros / cons

• All implementations are different

• All customer profiles are different

• Use this information to define what will work for you, you have to live it and

breath it

4 Confidential

Keep it Simple (Don’t take offence at the next 4 slides)

5 Confidential

“Disaster” Avoidance – Host Level

“Hey… That host WILL need to go down

for maintenance. Let’s vMotion to avoid

a disaster and outage.”

X

This is vMotion.

Most important

characteristics:

• By definition, avoidance,

not recovery.

• “non-disruptive” is

massively different than

“almost non-

disruptive”

6 Confidential

“Disaster” Recovery – Host Level

Hey… That host WENT down due to unplanned

failure causing a unplanned outage due to that

disaster. Let’s automate the RESTART of the

affected VMs on another host.

X

This is VMware HA.

Most important

characteristics:

• By definition recovery

(restart), not avoidance

• Simplicity, automation,

sequencing

7 Confidential

Disaster Avoidance – Site Level

Hey… That site WILL need to go down

for maintenance. Let’s vMotion to avoid

a disaster and outage.

This is Long Distance

vMotion.

Most important

characteristics:

• By definition, avoidance,

not recovery.

• “non-disruptive” is

massively different than

“almost non-

disruptive”

X

8 Confidential

Disaster Recovery – Site Level

Hey… That site WENT down due to unplanned

failure causing a unplanned outage due to that

disaster. Let’s automate the RESTART of the

affected VMs on another host.

This is Disaster Recovery.

Most important characteristics:

• By definition recovery (restart),

not avoidance

• Simplicity, testing, split brain

behavior, automation,

sequencing, IP address

changes X

9 Confidential

Types

10 Confidential

Site A Datastore

Type 1: “Stretched Single vSphere Cluster”

vMotion

vCenter Server

vSphere Cluster

Site A hosts

ESXi ESXi ESXi ESXi

Site B Datastore

Site B hosts

ESXi ESXi ESXi ESXi

Active / Active Storage

11 Confidential

One little note re: “Intra-Cluster” vMotion

Intra-cluster vMotions can be highly parallelized

• With vSphere 4.1 and vSphere 5 it’s up to 4 per host/128 per datastore if using

1GbE

• 8 per host/128 per datastore if using 10GbE

Need to meet the vMotion network requirements

• 622Mbps or more,

• 5ms RTT (upped to 10ms RTT if using Metro vMotion - vSphere 5 Enterprise

Plus)

• Layer 2 equivalence for vmkernel (support requirement)

• Layer 2 equivalence for VM network traffic (required)

12 Confidential

vSphere Cluster

Site A Datastore

Type 2: “Multiple vSphere Clusters”

vMotion

vCenter Server

vSphere Cluster

Site A hosts

ESXi ESXi ESXi ESXi

Site B Datastore

Site B hosts

ESXi ESXi ESXi ESXi

Active / Active Storage

13 Confidential

One little note re: “Inter-Cluster” vMotion

Inter-Cluster vMotions are serialized

• Involves additional calls into vCenter, so hard limit

• Lose VM cluster properties (HA restart priority, DRS settings, etc.)

Need to meet the vMotion network requirements

• 622Mbps or more

• 5ms RTT (upped to 10ms RTT if using Metro vMotion w vSphere 5 Enterprise

Plus)

• Layer 2 equivalence for vmkernel (support requirement)

• Layer 2 equivalence for VM network traffic (required)

14 Confidential

Stretched Cluster Considerations

Most networks lacks site awareness, so stretched clusters introduce new networking challenges.

With all storage configurations:

• The movement of VMs from one site to another doesn’t update the network

• VM movement causes “horseshoe routing” (LISP and other technologies, help address this)

• You’ll need to use multiple isolation addresses in your VMware HA configuration

15 Confidential

vSphere Cluster B

Site A Datastore

Type 3: “Site to Site Replication & Recovery”

vCenter

Server A

vSphere Cluster A

Site A hosts

ESXi ESXi ESXi ESXi

Site B Datastore

Site B hosts

ESXi ESXi ESXi ESXi

vCenter

Server B

Array-based (sync, async or

continuous) replication or vSphere

Replication (async)

SRM Server SRM Server

16 Confidential

Protection Groups

1

6

Collection of VMs that are protected together

• Grouping enforced by storage layout (lun or consistency group) for array

based replication (ABR). Per VM for vSphere Replication

VMFS

LUN

VMFS

LUN

VMFS

LUN

Datastore Groups Protection Groups

17 Confidential

Recovery Plans & Protection Groups

1

7

Protection Group - A

Protection Group - B

Recovery Plan for Groups A&B

Protection Group - A

Protection Group - B

18 Confidential

Recovery Plans

1

8

Essentially an automated “runbook”

for recovery

• Consists of one or more protection groups

• Controls every step of recovery process

• Storage (presentation)

• Network Customization (portgroup connection

and/or address changes)

• Power On Sequencing (dependencies

settable)

• Suspension of non essential workloads

• Invocation of customer defined pre/post

power on scripts (optional)

• Steps executed influenced by workflow

selected i.e Planned Migration / DR / Test

Failover

• Consists of one or more protection groups

19 Confidential

Group 5 Group 4 Group 3 Group 2 Group 1

Sequencing

Database Apache

Desktop

Desktop

Desktop

Desktop

Apache

Apache

Mail Sync Exchange

App Server

Master

Database

App Server

Database

20 Confidential

Test Failover – Non Disruptive

2

0

Protection Group

VMFS

LUN

Source Storage

(R/W)

VMFS

LUN

Replica Storage

(R/O)

VMFS

SNAP

Snapshot of Replica

Storage (R/W)

Test Network Placeholder VMs

1 2 3

21 Confidential

Failover

2

1

Protection Group

VMFS

LUN

Source Storage

(R/W)

VMFS

LUN

Replica Storage

(R/O)

VMFS

LUN

Replica Storage

Promoted (R/W)

Live Network Placeholder VMs

1 2 3

22 Confidential

Automated Failback

• Available for ALL SAN Replication SRA’s

• Implemented via new “Reprotect” workflow

• Resets protected state for workloads migrated or recovered

• Single button invocation

• “Flips” Protection Group and Recovery Plan states A->B becomes B->A

• No requirement to manually recreate objects

Automated Failback (Reprotect)

23 Confidential

What if we don’t have SAN replication?

24 Confidential

Introducing vSphere Replication (VR)

25 Confidential

VR Basics

Adding native replication to SRM

• Virtual machines can be replicated regardless of the underlying storage

• Enables replication between heterogeneous datastores

• Replication is managed as a property of a virtual machine

• Efficient replication minimizes impact on VM workloads

source target

26 Confidential

When To Use Stretched vSphere Clusters?

Campus / nearby sites

• Sites within Synchronous distance

• Two buildings on a common campus

• Two datacenters within a city

Planned migration important

• Long-distance vMotion for planned maintenance, disaster avoidance, or load

balancing

DR Features less critical

• No testing, orchestration, or automation

• VMware HA typically not sufficient for automation – requires scripting /

manual process due to VM placement with primary / secondary arrays

• RTOs typically longer

27 Confidential

When To Use Site Recovery Manager?

Longer-distance DR sites

• Any sites separated by >100km

• Any sites separated by <100km which could still be categorized as “DR”

where DR features are important

DR Features critical

• Non-disruptive testing

• Automated / Reliable / Repeatable / Auditable DR process

• Customizable recovery workflows

Planned migration with downtime ok

• Couple of hours downtime acceptable

• Planned migration not done routinely – mostly for disaster avoidance, and

infrequently for planned maintenance

28 Confidential

Software Defined Availability: vCloud Services

Software Defined

Availability

Clustering

Disaster Recovery

Replication

Data Protection

Availability

DR RTO

RPO

Storage

Performance

99.99%

1 hour

10 Min

High I/O

High Security

1 TB

Unified

Any Class of Protection

Flexible Service Level Options

Tier 1

Se

rvic

e L

eve

l

Tier 2 Tier 3

Bronze

99 %,

Low

Performance

Silver

99.9%,

Medium

Performance

Gold

99.99%,

High

Performance

Any Application

Traditional / Next-Gen Applications

Apps

Anywhere

Application Mobility Over Any Distance

(Metro/Geo)

Public

Cloud

Component Failures to Large Scale Disasters

Any Failure Scenario

29 Confidential

Thank you