Our environment is fully built using VMware Auto Deploy 5.1. All the clusters have their own ruleset and we’re very content on how easily we update our hosts and deploy new hosts. However, recently we ran into a major issue with Microsoft Cluster Services inside VMs in combination with VMware Auto Deploy. In a scenario with an “old fashioned” MCSC cluster across boxes with Physical RDMs shared between the two cluster nodes (VMs), these RDMs seem to be stopping an Auto Deployed ESXi host from reconnecting to vCenter Server after a reboot.
A normal reboot of an Auto Deployed ESXi host goes like this:
– Power ON
– Ask DHCP for IP address and PXE boot server address
– Connect to TFTP server
– Talk to Auto Deploy Server
– Download ESXi image based on rule set
– Boot that image in a default configuration
– Report to vCenter Server
– Retrieve and apply host profile that holds the proper configuration settings
– Exit maintenance mode
When MSCS RDMs are present, the following will happen (taken from VMware KB 1016106): “During the start of an ESXi host, the storage mid-layer attempts to discover all devices presented to an ESXi host during the device claiming phase. However, MSCS LUNs that have a permanent SCSI reservation cause the start process to lengthen as the ESXi host cannot interrogate the LUN due to the persistent SCSI reservation placed on a device by an active MSCS Node hosted on another ESXi host.”
For a normal ESXi host, the “only” issue is the longer boot time, but with an Auto Deployed host, that longer scan time will give time outs in the process of connecting to vCenter Server and eventually fail to reconnect or application of the host profile leaving the host in a rather useless state. For us this means that for every host reboot we first need to disconnect the RDMs from the host, perform the reboot, make sure the host is configured correctly and then reconnect the RDMs. Not very convenient.
The mentioned VMware KB 1016106 explains how to use the “perennially-reserved=true” setting. This setting will be set on a per LUN (RDM) basis:
esxcli storage core device setconfig -d naa.id --perennially-reserved=true
For an Auto Deployed host this won’t work however, because the setting needs to be applied through a host profile which because of the slow scanning, is never applied. VMware Support acknowledged the issue and said there will be a fix in 5.1 Update 3 and 5.5 Update 3. Until that time we’re going back to local installed ESXi hosts that run VMs which are part of a MCSC cluster.
We have had the same problem in our environment with Microsoft Clusters running SQL Server. We have to have these segmented off from our other Auto Deploy hosts and have them installed locally. Can’t wait for the fix so that we can go back to all stateless with the exception of a few hosts to run the Auto Deploy infrastructure.