Ten VSphere HA and DRS Misconfiguration Issues

Ten vSphere HA and DRS

Misconfiguration Issues: High Availability

Issues

AUTHOR: JOHN HALES 3 JANUARY 2012 2,729 VIEWS NO COMMENTS

TAGS: DRS, HA, VCENTER, VSPHERE 5.0

VMware has a popular and powerful virtualization suite of products in the vSphere

and vCenter family of products. This post focuses on ten of the biggest mistakes people make when configuring the High

Availability (HA) and Distributed Resource Scheduler (DRS) features. We’ll begin by looking at five common HA issues, then

we’ll look at four common DRS issues, then conclude with an issue that affects both HA and DRS.

HA Issues

HA is included in almost every version of vSphere, including one of the small business bundles (Essentials Plus), as the impact

of an ESXi host failure is much bigger than the loss of a single server in the traditional world because many virtual machines

(VMs) are affected. Thus, it is very important to get HA designed and configured correctly.

Purchasing Differently Configured Servers

One of the common mistakes people make is buying differently sized servers (more CPU and/or memory in some servers than

others) and placing them in the same cluster. This is often done with the idea that some VMs require a lot more resources than

others, and the big, powerful servers are more expensive than several smaller servers. The problem with this thinking is that HA

is pessimistic and assumes that the largest servers will fail.

Solution: Either buy servers that are configured the same (or at least similarly) or create a couple of different clusters, with

each cluster having servers configured the same. Some people also implement affinity rules to keep the big VMs on designated

servers, but this impacts DRS – we’ll cover that issue later.

Insufficient Hosts to Run All VMs Accounting for HA Overhead

When budgets are tight, many administrators size their environments to have sufficient resources to run all the VMs that are

needed but forget to take into account the overhead HA imposes to guarantee that sufficient resources exist to restart the VMs

on a failed host (or multiple hosts, if you are pessimistic). VMware’s best practice is to always leave Admission Control enabled

to have HA automatically set aside resources to restart VMs after a host failure.

http://blog.globalknowledge.com/tag/vsphere-5-0/

http://blog.globalknowledge.com/tag/vcenter/

http://blog.globalknowledge.com/tag/ha/

http://blog.globalknowledge.com/tag/drs/

http://blog.globalknowledge.com/wp-content/uploads/2011/11/abstractmeasurement12415.jpg

Solution: Plan for the HA overhead and purchase sufficient hardware to cover the resources required by the VMs in the

environment plus the overhead for HA.

Using the Host Failures Cluster Tolerates Policy

Recall that there are three admission control policies, namely:

Host failures the cluster tolerates: The original (and only) option for HA, this type assumes the loss of a specified number

of hosts (one to four in versions 3 and 4, up to 31 in vSphere 5).

Percentage of cluster resources reserved as failover spare capacity: Introduced in vSphere4, this option sets aside a

specified percentage of both CPU and memory resources from the total in the cluster for failover use; vSphere 5 improved

this option by allowing different percentages to be specified for CPU and memory.

Specify failover hosts: This policy specifies a standby host that runs all the time but is never used for running VMs unless a

host in the cluster fails. It was introduced in vSphere 4 and upgraded in version 5 by allowing multiple hosts to be

specified.

As described previously, HA is pessimistic, and always assumes the largest host will fail, reserving more resources than usually

needed if the hosts are sized differently (though, per issue one, we don’t recommend that). This policy also uses a concept

called slots to reserve the right amount of spare capacity, but it assumes a “one size fits all” policy in this regard and uses the

VM with the largest CPU and the largest memory reservation as the slot size for all VMs.

Solution: Use the VMware recommended policy of percentage of cluster resources reserved as failover spare capacity instead,

which takes a Percentage of the entire cluster’s resources and uses actual reservations on each VM instead of using the largest

reservation.

Forgetting to Update the Percentage Admission Control Policy as Cluster Grows

If the Percentage of cluster resources reserved as failover spare capacity policy is used (as suggested), it is important to

reserve the correct amount of CPU and memory based on the needs of the VMs and the size of the cluster. For example, in a

two-node cluster, the loss of one of the nodes removes half of the cluster resources (assuming they are sized the same). Thus,

the percentage may be set to 50. However, if additional nodes are added to the cluster later, that value is probably too high and

should be reduced to take into account the additional node(s) and the number of simultaneous failures expected (for example

with four nodes, the loss of one node suggests that the percentage be set to 25, while if two failures are expected, then 50

percent should be used).

Solution: Go back and recalculate the appropriate value in your cluster whenever hosts are added to or removed from the

cluster.

Configuring VM Restart Priorities Inefficiently

One of the settings that can be set in an HA cluster is the default restart priority of VMs after a host failure. This defaults to

Medium, but can be set to Low, Medium, or High, or Disabled, if most VMs should not be restarted after a host failure.

Solution: Consider setting the cluster default for restart priority to Low, enabling two higher levels for VMs. For example, maybe

infrastructure VMs such as domain controllers or DNS servers are the highest priority (setting those VMs to High), followed by

critical services, such as database or e-mail servers (setting those VMs to Medium), and then the rest of the VMs will be at the

default (Low). Any VMs that don’t need to be restarted can be set to Disabled to save resources after a host failure.

Recommended Courses

VMware vSphere: Fast Track [V5.0]

VMware vSphere: Install, Configure, Manage [V5.0]

VMware vSphere: What’s New [V5.0]

Reprinted with permission from Ten vSphere HA and DRS Misconfiguration Issues

Ten vSphere HA and DRS Misconfiguration Issues Series

http://www.globalknowledge.com/training/olm/go.asp?find=blog0103vmwp&country=United+States

http://www.globalknowledge.com/training/olm/go.asp?find=blog01033198&country=United+States



Documents

Ten VSphere HA and DRS Misconfiguration Issues