38
Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Embed Size (px)

Citation preview

Page 1: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Failover Clustering & Hyper-V: Multisite Disaster Recovery

Prakash GopinadhamSupport Escalation Engineer

Microsoft Corporation

Page 2: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site Clustering Content

Design guide: http://technet.microsoft.com/en-us/library/dd197430.aspxDeployment guide/checklist: http://technet.microsoft.com/en-us/library/dd197546.aspxCustomer case studies using multi-site clustering:http://blogs.msdn.com/b/clustering/archive/2009/11/04/9917628.aspx

Page 3: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site Clustering

Introduction

Networking

Storage

Quorum

Page 4: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Defining High-Availability

But what if there is a catastrophic event and you lose the entire datacenter?

Site A

High-Availability allows applications

to maintain service availability bymoving them between nodes in a cluster

Page 5: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Defining Disaster Recovery

Disaster Recovery (DR) allows applications to maintain service availability by moving them to a cluster node in a different physical location

Site B

Node is located at a physically separate site

SAN

Site A

Site A

Site B

Page 6: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Benefits of a Multi-Site Cluster

Protects against loss of an entire locationPower Outage, Fires, Hurricanes, Floods, Earthquakes, Terrorism

Automates failoverReduced downtimeLower complexity disaster recovery plan

What is the primary reason why DR solutions fail?

Dependence on People

Page 7: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site Clustering

Introduction

Networking

Storage

Quorum

Page 8: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Stretching the Network Longer distance traditionally means greater network latency Missed inter-node health checks can cause false failover Cluster heartbeating is fully configurable

– SameSubnetDelay (default = 1 second)• Frequency heartbeats are sent

– SameSubnetThreshold (default = 5 heartbeats)• Missed heartbeats before an interface is considered down

– CrossSubnetDelay (default = 1 second)• Frequency heartbeats are sent to nodes on dissimilar subnets

– CrossSubnetThreshold (default = 5 heartbeats)• Missed heartbeats before an interface is considered down to

nodes on dissimilar subnets

– Command Line: Cluster.exe /prop– PowerShell (R2): Get-Cluster | fl *

Page 9: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Security over the WAN

• Encrypt inter-node communication• Trade-off security versus performance

– 0 = clear text– 1 = signed (default)– 2 = encrypted

10.10.10.1 20.20.20.1

30.30.30.1 40.40.40.1

Site A Site B

Page 10: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Network Considerations

Network Deployment Options:1. Stretch VLANs across sites2. Cluster nodes can reside in different subnets

Public Network

10.10.10.1 20.20.20.1

30.30.30.1 40.40.40.1

Redundant Network

Site A

Site B

Page 11: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

DNS Considerations Nodes in dissimilar subnets VM obtains new IP address Clients need that new IP Address from DNS to reconnect

10.10.10.111 20.20.20.222

DNS Server 1 DNS Server 2DNS Replication

Record Created

VM = 10.10.10.111

Record Updated

VM = 20.20.20.222

Record Updated

Record Obtained

Site A Site B

Page 12: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Faster Failover for Multi-Subnet Clusters

• RegisterAllProvidersIP (default = 0 for FALSE)– Determines if all IP Addresses for a Network Name will be registered by DNS– TRUE (1): IP Addresses can be online or offline and will still be registered– Ensure application is set to try all IP Addresses, so clients can come online

quicker

• HostRecordTTL (default = 1200 seconds)– Controls time the DNS record lives on client for a cluster network name– Shorter TTL: DNS records for clients updated sooner– Exchange Server 2007 recommends a value of five minutes (300 seconds)

Page 13: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Solution #1: Local Failover First

Configure local failover fist for high availability– No change in IP addresses– No DNS replication issues– No data going over the WAN

Cross-site failover for disaster recovery

10.10.10.111

DNS Server 1

10.10.10.111

20.20.20.222

Site A Site B

Page 14: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Solution #2: Stretch VLANs

Deploying a VLAN minimizes client reconnection times– IP of the VM never changes

DNS Server 1 DNS Server 2

FS = 10.10.10.111

10.10.10.111

VLAN

Site A

Site B

Page 15: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Solution #3: Abstraction in Networking Device

• Networking device uses independent 3rd IP Address• 3rd IP Address is registered in DNS & used by client

10.10.10.111 20.20.20.222

DNS Server 1

DNS Server 2

VM = 30.30.30.30

30.30.30.30

Site A

Site B

Page 16: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site Clustering

Introduction

Networking

Storage

Quorum

Page 17: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Storage in Multi-Site Clusters

Different than local clusters:– Multiple storage arrays – independent per site– Nodes commonly access own site storage– No ‘true’ shared disk visible to all nodes

Site B

SAN

Site A Site B

Site A Site B

Page 18: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Storage Considerations

Site A

Changes are made on Site A and replicated to Site B

DR requires data replication mechanism between sites

Site B

SAN

Site A Site B

Replica

Site BSite A

Page 19: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Replication Partners

Hardware storage-based replication• Block-level replication

Software host-based replication• File-level replication

Appliance replication• File-level replication

Page 20: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Synchronous Replication

Host receives “write complete” response from the storage after the data is successfully written on both storage devices

PrimaryStorage

SecondaryStorage

WriteComplete

Replication

Acknowledgement

WriteRequest

Site A Site B

Page 21: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Asynchronous Replication

• Host receives “write complete” response from the storage after the data is successfully written to just the primary storage device, then replication

Primary Storage

Secondary Storage

WriteComplete

WriteRequest

Replication

Site A Site B

Page 22: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Synchronous versus Asynchronous

Synchronous Asynchronous

No data loss Potential data loss on hard failures

Requires high bandwidth/low latency connection

Enough bandwidth to keep up with data replication

Stretches over shorter distances

Stretches over longer distances

Write latencies impact application performance

No significant impact on application performance

Page 23: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Cluster Validation with Replicated Storage

Multi-Site clusters are not required to pass the Storage tests to be supported

Validation Guide and Policyhttp://go.microsoft.com/fwlink/?LinkID=119949

Page 24: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Challenges of Block Storage Replication

Storage block level replication typically Uni-Directional

(per LUN)• Change blocks flow from source to remote• Possible to have different LUNs replicating in different

directions• Storage cannot enforce block level collision resolution• Application must determine resolution, or be coordinated in

some way Applications today implement shared nothing

model• Surfacing storage as R/W at multiple sites is only useful if

applications can handle a distributed access device• Few applications implement the necessary supportObvious exception is Cluster Shared Volumes for Hyper-V

Page 25: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site Clustering

Introduction

Networking

Storage

Quorum

Page 26: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Quorum Overview

• Disk only (not recommended)• Node and Disk majority

• Node majority• Node and File Share majority

VoteVote Vote Vote Vote

Majority is greater than 50%Possible Voters:

Nodes (1 each) + 1 Witness (Disk or File Share) 4 Quorum Types

Page 27: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Replicated Disk Witness

• A witness is a tie breaker when nodes lose network connectivity– The witness disk must be a single decision maker, or problems can

occur• Do not use a Disk Witness in multi-site clusters unless directed by vendor

Replicated Storage

?Vote Vote Vote

Page 28: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Node Majority

Cross site network

connectivity broken!

can I communicate with majority of the nodes in the cluster?

Can I communicate with majority of the nodes in

the cluster?

5 Node Cluster: Majority = 3

Majority in Primary Site

Site A Site B

Yes, then Stay Up No, drop out of Cluster Membership

Page 29: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Node Majority

Disaster at Site 1

Can I communicate with majority of the nodes in the cluster?

No, drop out of Cluster Membership

5 Node Cluster: Majority = 3

Need to force quorum

manually

Site A

We are down!

Majority in Primary Site

Site A Site B

Page 30: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Forcing Quorum

Forcing quorum is a way to manually override and start a node even if the cluster does not have quorum– Important: understand why quorum was lost– Cluster starts in a special “forced” state– Once majority achieved, drops out of “forced” state

Command Line:• net start clussvc /fixquorum (or /fq)

PowerShell (R2):• Start-ClusterNode –FixQuorum (or –fq)

Page 31: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Multi-Site with File Share Witness

Site C (branch office)

Complete resiliency and automatic recovery from the loss of any 1 site \\Foo\Share

WAN

File Share Witness

Site A Site B

Page 32: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

File Share Witness

Multi-Site with File Share Witness

\\Foo\Share

WAN

Complete resiliency and automatic recovery from the loss of connection between sites

Can I communicate with majority of the nodes in the cluster?

Can I communicate with majority of

the nodes (+FSW) in the cluster?

Site C (branch office)

Site A Site B

No (lock failed), drop out of Cluster Membership

Yes, then Stay Up

Page 33: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

File Share Witness (FSW) Considerations

Simple Windows File Server Single file server can serve as a witness for

multiple clusters – Each cluster requires it’s own share– Can be made highly available on a separate cluster

Recommended to be at 3rd separate site for DR FSW cannot be on a node in the same cluster FSW should not be in a VM running on the same cluster

Page 34: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Quorum Model Recap

• Even number of nodes• Highest availability solution has

FSW in 3rd site

Node and File Share Majority

• Odd number of nodes• More nodes in primary siteNode Majority

• Use as directed by vendorNode and Disk Majority

• Not Recommended• Use as directed by vendor

No Majority: Disk Only

Page 35: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Session Summary

Multi-site Failover Clusters have many benefits You can achieve high-availability and disaster recover in a

single solution using Windows Server Failover Clustering

Multi-site clusters have additional considerations:• Determine network topology across sites• Choose a storage replication solution• Plan quorum model & nodes

Page 36: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

Failover Clustering Resources

• Design for a Clustered Service or Application in a Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197430(WS.10).aspx

• Checklist: Setting Up a Clustered Service or Application in a Multi-Site Failover Cluster http://technet.microsoft.com/en-us/library/dd197546(WS.10).aspx

• Cluster Information Portal: http://www.microsoft.com/windowsserver2008/en/us/clustering-home.aspx

• Clustering Technical Resources: http://www.microsoft.com/windowsserver2008/en/us/clustering-resources.aspx

• Clustering Forum (2008): http://forums.technet.microsoft.com/en-US/winserverClustering/threads/

• http://social.technet.microsoft.com/Forums/en-US/windowsserver2008r2highavailability/threads/

• R2 Cluster Features: http://technet.microsoft.com/en-us/library/dd443539.aspx

Page 37: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

ResourcesSoftware Application

Developers

http://msdn.microsoft.com/

Infrastructure Professionals

http://technet.microsoft.com/

msdnindia technetindia @msdnindia @technetindia

Page 38: Failover Clustering & Hyper-V: Multisite Disaster Recovery Prakash Gopinadham Support Escalation Engineer Microsoft Corporation

© 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and

Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.