VMware HA Cluster Failure – Split Brain Interrogation
If one or more VMWare ESX cluster nodes have suffered a hard crash or failure, you must reintroduce them back into the cluster by following these steps below. Do these steps for each host one at a time. This guide is helpful when multiple ESX hosts in an HA cluster have crashed due to a power outage, massive hardware failure, etc and the HA service on all or some of the ESX nodes in the cluster are non-functional. Furthermore, virtual machines have been displaced by the God forbid this ever happens to you "split-brain scenario".
It may be useful using PowerShell to initially query the cluster for your HA Primaries. I use the VMware PowerCLI and run this simple script I call Get-HA-Primaries.ps1
Connect-VIServer YourVirtualCenterServerNameHere
((Get-View (Get-Cluster YourESXClusterNameHere).id).RetrieveDasAdvancedRuntimeInfo()).DasHostInfo.PrimaryHosts
This will output what the cluster currently knows about HA Primaries.
1) At the root of the cluster, disable VMotion by setting it to “Manual”. This is to ensure that migrations do not start until all nodes are correctly configured and are back in the cluster. In Virtual Center, right click the root of the cluster and choose “Edit Settings”, click on “VMWare DRS”, set it to “Manual” and click OK.
2) Power on the ESX host if it is off and watch it from the console to make sure it boots properly.
3) Next, log into the SIM page of the host (if applicable) as root to validate that the hardware is not displaying any obvious problems.
4) In Virtual Center, verify that the ESX host is back in the cluster. If the host shows disconnected or has any HA errors, do steps 4 thru 8 in their exact order.
5) Restart the Virtual Center Server service – “VMware VirtualCenter Server”
6) Run the following commands from the problematic ESX host’s console (KVM, local console or Putty) as sudo or root.
5) service vmware-vpxa restart
6) service mgmt-vmware restart
7) service xinetd restart
Verify that the VMware core services are running on the host server by typing:
ps -ef | grep hostd
It should show results similar to this: The following shows that hostd is running.
root 1887 1 0 Oct31 ? 00:00:01 cmahostd -p 15 -s OK
root 2713 1 0 Oct31 ? 00:00:00 /bin/sh /usr/bin/vmware-watchdog -s hostd -u 60 -q 5 -c /usr/sbin/hostd-support /usr/sbin/vmware-hostd -u
root 2724 2713 0 Oct31 ? 00:11:41 /usr/lib/vmware/hostd/vmware-hostd /etc/vmware/hostd/config.xml -u
root 21263 12546 0 11:34 pts/0 00:00:00 grep hostd
End of host commands
8) Reconfigure HA within VMCenter by right-clicking on the VM host and selecting “Reconfigure for HA”. If any HA or connection errors persist, try disconnecting and reconnecting the host. These are both right-click operations on the host from within VMCenter. You may be asked to re-authenticate the host to VMCenter. Simply provide the root password for the host if you are prompted by this wizard.
If the host cannot be re-connected after following these steps, either call the VMWare lead or VMWare support at 1-877-4VM-Ware.
If the host becomes connected and operational, you may have VM guest registration issues.
There are several different scenarios that may require you to remove and re-add the virtual machines back into inventory. If multiple hosts crash simultaneously, you will most likely have HA issues that create a known state called “split-brain” whereas virtual machines are split around the cluster due to the SAN locking mechanism used by the ESX host servers. This results in more than one host “thinking” it has the same virtual machine registered to it. Also, the SAN locking on the hosts could have locks on the guest’s vswap files on several hosts at the same time. You must release the lock manually on each host with the outdated vswap file location info. This is time consuming. The virtual machine(s) will not boot until the lock is freed. The following command allows one to view where the lock is located (always on either vmnic0 or vmnic1) by enumerating the MAC address to determine which host has the invalid data.
vmkfstools -D /vmfs/volumes/sanvolumename/vmname/swapfile
tail -f /var/log/vmkernel
Once you identify the host, reboot it to flush the memory and locks to force the release of bad, outdated vm inventory data. Be sure to migrate all of the guests off and put the host into maintenance mode prior to rebooting it.
If the MAC indicates that the vm guest is actually locked on the host the guest is attempting to boot from, simply delete the vswap file and let the guest re-create it upon booting. The way to determine if the host booting the guest is the owner, the output command will contain all zeroes in the hex field the MAC would be otherwise. The vswap file is in the virtual machines folder in /vmfs/volumes/sanvolumename/vmname.
To view vm registration on a host, view /etc/vmware/hostd/vmInventory.xml
This is the esx host’s local database file for vm inventory.
Also can view this file via, vmware-cmd –l from the \ directory.
Good luck.