How It Started
I screwed up a vCenter instance. Actually it is pretty easy to screw up the state-of-the-art hypervisor controller from its beautifully designed web UI, using the appealing buttons that always have been there. The process only requires 2 simple steps:
- Enable vCenter HA
- Replace the machine SSL certificate
The vCenter HA documentation do state “if you want to use custom certificates, you have to remove the vCenter HA configuration” using the smallest font size possible, but the warning is nowhere mentioned in the documentation related to replacing SSL certificates where it should be. The UI won’t stop you from playing with fire, either.
If you have enough time and a lab environment then give it a try. The vCenter VM will reboot a few times before it completely stops working. It will still spin up, but you won’t be able to login anymore. You’ll see a very unhelpful error message on the login screen:
An error occurred when processing the metadata during vCenter Single Sign-On setup – Failed to connect to VMware Lookup Service
By the way, don’t bother trying the vSphere Certificate Manager command-line tool to unscrew the situation; that tool will refuse to do anything if it detects itself running in a HA vCenter cluster. So, if you don’t have any backup or snapshot to revert to, your vCenter is dead.
Things were a little complicated for my case: The dead vCenter VM ran on a 3-node hyperconverged cluster with HA, DRS and vSAN. As the vCenter goes down, now I have a problem.
How It’s Going
Luckily, the ESXi hypervisor is largely independent from vCenter, so I could still log in and do something on the individual hypervisors. Now I had to do something to (hopefully) make the situation better.
Preparing
The first obvious thing I did was to shut down the old vCenter VMs. These does not work anymore and might interfere with the recovery process.
Next thing I did was to do a backup of all important data on the cluster. Backing up an ESXi hypervisor is easy: mount some NFS storage on each hypervisor, and manually move/copy the VMs over. vMotion wouldn’t be available so everything had to be done by hand, when the VMs were shut down.
Then I shut down as many VMs as I can. Although there was possibility that one can rebuild the cluster while keeping some VMs running, I recommend against that.
Prepare a vCenter installer ISO on the workstation, and let’s get into the recovery process.
The First (Unsuccessful) Attempt
Being rather unfamiliar with the new vSphere 7.0, initially my strategy was to just reinstall the vCenter directly onto the vSAN storage, take over the hosts, rebuild the distributed switch by hand, and simply re-configure the cluster. The process did not work: while adding the first host, vCenter reported “Found host(s) esxi02.corp.contoso.com, esxi03.corp.contoso.com participating in the vSAN service which is not a member of this host’s vCenter cluster”, and after a few seconds, vCenter freezed. Later investigation showed that vCenter detected the host had vSAN configured, so it overwrote a single-node vSAN configuration onto that host, breaking the storage it was running on.
Now I have 2 problems: a dead vCenter, and a 3-node vSAN cluster in a split-brain situation.
The Second (Successful) Attempt
Knowing that vSAN won’t automatically delete any inaccessible/broken object, I was confident that all my data was still there, it was just the vSAN configuration that need to be fixed to at least keep the storage running. After some searching on the Internet, I found out that you can actually manage all vSAN configuration on the ESXi hypervisor host! There are some not-very-helpful official documentation on the esxcli vsan subcommand, but it was enough to get me on the correct track.
I enabled SSH on all the hosts, and issued this command to every host:
1 |
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates |
This essentially told the vSAN agent running on every host to ignore everything sent by any vCenter. Now that the “manual transmission” mode is engaged, I started to recover the vSAN.
First let’s confirm the status:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
[root@esxi01:~] esxcli vsan cluster list Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1 Enabled: true Current Local Time: 2020-10-29T07:05:18Z Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec Sub-Cluster Backup UUID: Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1 Sub-Cluster Membership Entry Revision: 0 Sub-Cluster Member Count: 1 Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec Sub-Cluster Member HostNames: esxi01.corp.contoso.com Sub-Cluster Membership UUID: 665dbc18-5bde-4cb6-a510-7c5185c78f3d Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: dadf3e7c-8162-4815-9d02-08af4d8c4c7b 2 2020-10-29T06:29:11.652 [root@esxi03:~] esxcli vsan cluster list Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1 Enabled: true Current Local Time: 2020-10-29T07:09:48Z Local Node UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2 Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2 Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1 Sub-Cluster Membership Entry Revision: 2 Sub-Cluster Member Count: 2 Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576 Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: dd0af2e3-d7e0-4407-9a50-d87be61513b3 9 2020-10-22T08:59:00.661 |
We indeed had a split brain. Then kick esxi01 out of the imaginary one-node cluster (it is a very slow process, have some patience), and re-join it with the correct sub-cluster UUID from the other hosts’ config:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
[root@esxi01:~] esxcli vsan cluster leave [root@esxi01:~] esxcli vsan cluster join -u 3a02d572-728d-482b-a94d-2245a6ec99d1 [root@esxi01:~] esxcli vsan cluster list Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1 Enabled: true Current Local Time: 2020-10-29T07:09:55Z Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec Local Node Type: NORMAL Local Node State: MASTER Local Node Health State: HEALTHY Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec Sub-Cluster Backup UUID: Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1 Sub-Cluster Membership Entry Revision: 0 Sub-Cluster Member Count: 1 Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec Sub-Cluster Member HostNames: esxi01.corp.contoso.com Sub-Cluster Membership UUID: ab6a9a5f-2401-89af-99aa-0c42a171e24e Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: None 0 0.0 |
A vCenter configured vSAN cluster would be in the unicast mode (i.e. peer discovery depends on the IP list sent by the control plane), so we also need to synchronize the IP address list of the cluster on every host. Verify the VMKernel adapter for vSAN is set up on esxi01:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
[root@esxi01:~] esxcli vsan network list Interface VmkNic Name: vmk2 IP Protocol: IP Interface UUID: 699fe1e6-eaba-49db-9d04-8859ed2b066f Agent Group Multicast Address: 224.2.3.4 Agent Group IPv6 Multicast Address: ff19::2:3:4 Agent Group Multicast Port: 23451 Master Group Multicast Address: 224.1.2.3 Master Group IPv6 Multicast Address: ff19::1:2:3 Master Group Multicast Port: 12345 Host Unicast Channel Bound Port: 12321 Data-in-Transit Encryption Key Exchange Port: 0 Multicast TTL: 5 Traffic Type: vsan |
If you don’t see “vsan” traffic type in the output, reconfigure your VMKernel adapter. Since esxi02 and esxi03 already know each other, we can concentrate the list from the 2 hosts…
1 2 3 4 5 6 7 8 9 10 11 12 |
[root@esxi02:~] esxcli vsan cluster unicastagent list NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid ------------------------------------ --------- ---------------- -------------- ----- ---------- ----------------------------------------------------------- -------------- 67874ba3-8fd5-463f-80fb-6a82910c5ff2 0 true 192.168.1.201 12321 73:F4:93:D8:D8:2A:C0:D3:4F:A6:DF:4D:3D:BE:34:8C:15:D9:45:52 3a02d572-728d-482b-a94d-2245a6ec99d1 9f7326ad-f815-45b1-a809-ece25fddc7ec 0 true 192.168.1.215 12321 05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6 3a02d572-728d-482b-a94d-2245a6ec99d1 [root@esxi03:~] esxcli vsan cluster unicastagent list NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid ------------------------------------ --------- ---------------- -------------- ----- ---------- ----------------------------------------------------------- -------------- 9f7326ad-f815-45b1-a809-ece25fddc7ec 0 true 192.168.1.215 12321 05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6 3a02d572-728d-482b-a94d-2245a6ec99d1 04e3bd93-2846-4474-bae7-e16b602e316f 0 true 192.168.1.160 12321 6D:E4:62:CA:FB:17:96:41:97:F4:22:B9:8F:D8:B2:5E:93:0F:79:0D 3a02d572-728d-482b-a94d-2245a6ec99d1 |
then play them back onto esxi01 (if you have vSAN witness applications, you need to slightly change the arguments here):
1 2 3 4 5 6 7 |
[root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.201 -U true -u 67874ba3-8fd5-463f-80fb-6a82910c5ff2 -t node [root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.160 -U true -u 04e3bd93-2846-4474-bae7-e16b602e316f -t node [root@esxi01:~] esxcli vsan cluster unicastagent list NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid ------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- -------------- 67874ba3-8fd5-463f-80fb-6a82910c5ff2 0 true 192.168.1.201 12321 3a02d572-728d-482b-a94d-2245a6ec99d1 04e3bd93-2846-4474-bae7-e16b602e316f 0 true 192.168.1.160 12321 3a02d572-728d-482b-a94d-2245a6ec99d1 |
As esxi01’s IP addresses are not changed, no changes are needed on the other 2 hosts. Let’s verify if vSAN is up and running again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
[root@esxi01:~] esxcli vsan cluster get Cluster Information Enabled: true Current Local Time: 2020-10-29T07:15:40Z Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec Local Node Type: NORMAL Local Node State: AGENT Local Node Health State: HEALTHY Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2 Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1 Sub-Cluster Membership Entry Revision: 3 Sub-Cluster Member Count: 3 Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f, 9f7326ad-f815-45b1-a809-ece25fddc7ec Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com, esxi01.corp.contoso.com Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576 Unicast Mode Enabled: true Maintenance Mode State: OFF Config Generation: 9f7326ad-f815-45b1-a809-ece25fddc7ec 2 2020-10-29T07:15:25.0 |
Yay!
Rest of the steps are pretty straightforward. The key takeaway here is: to join a host to a cluster, it must be either in the maintenance mode (i.e. all VMs shut off), or only have vCenter running on it. All other steps are essential to solve the chicken-and-egg problem.
- Shutdown all the VMs running on the hosts if you haven’t already done this
- Find the node with the oldest CPU (assume it is esxi01), and if possible, connect a temporary non-vSAN datastore (NFS or a local storage device)
- install vCenter onto esxi01 using the temporary datastore
- Set up vCenter (networking, admin user, certificate)
- Add esxi01 to the vCenter, put it in a new cluster, you can use the cluster quickstart wizard but do not let it configure networking for you
- Enable VMWare EVC on the new cluster
- If you have a backup for distributed switch config, restore it; otherwise configure a new distributed switch
- Add another host (say, esxi02) to the vCenter, do not add it to a cluster yet
- Add esxi02 to the distribute switch and migrate all adapters
- vMotion the vCenter VM to esxi02
- Add esxi01 and esxi03 to the distribute switch and migrate all adapters
- Go to the web portal of esxi02 and esxi03, put them into maintenance mode, set vSAN migration mode to “no data migration” (do not use vCenter to put them into maintenance mode as this will cause vSAN to evict data; also, this will temporary block all requests to the vSAN datastore, so make sure nothing is running on it)
- Add esxi02 and esxi03 to the cluster and configure the cluster in the quickstart wizard
- If this caused vSAN to move some data back and forth, wait for the migration to finish
- Verify all objects in vSAN is readable, and try restart the VMs
- vMotion the vCenter back onto the vSAN datastore
Now we have a new vCenter server and a new cluster good to go.
Cleaning Up
If you still want to configure vSAN from vCenter later, first execute the following command on every ESXi host:
1 |
esxcfg-advcfg -d /VSAN/IgnoreClusterMemberListUpdates |
This allows the vSAN agent to receive further configuration from the vCenter. Then let vCenter synchronize once with all the hosts: Cluster -> Monitor -> Skyline Health -> vCenter state is authoritative -> click on “UPDATE ESXI CONFIGURATION”.
If you have custom storage policies, you can restore them using the following command in vCenter ruby console:
1 2 |
Command> rvc administrator@vsphere.local@localhost vsan.recover_spbm /localhost/<datacenter_name>/computers/<cluster_name> |
vSAN default policy will be created automatically.
If you have any inaccessable object, SSH login to one of the hosts containing that object, then delete it manually:
1 |
/usr/lib/vmware/osfs/bin/objtool delete -f -v 10 -u <object_uuid> |
The following things will require a rebuilt by hand in the new vCenter:
- users, groups, permissions
- content libraries
- host profiles
- HA & DRS
- VM rules
If you have vSAN file services configured, you might need to re-enable them from vCenter. You will need to re-upload the OVAs, and you won’t be able to change the configuration. Note that vSAN file services 7.0U1 is extremely buggy and locked itself up (I can’t enable it/disable it/configure it/use it) on my cluster, so I currently do not recommend using it in production.
If you have some “Unable to connect to MKS” error when connecting to VM consoles on the new vCenter: see “Unable to connect to MKS” error in vSphere Web Client (2115126)
Final Thoughts
One thing I like about vSphere is its ability to continue functioning without a centralized control plane. HA, multiple-access datastores, and vSAN are all designed around this basic assumption and this have saved me many times. On the other hand, vCenter is a fragile thing, and vCenter 7.0, with a lot legacy Java components being rewritten by Python, is much more fragile than ever before.
Always export and backup your distributed switch config, even if you have automated backup for vCenter. This will save you a lot time in case you must set up a new vCenter. If you have vSAN file services configured, failing to restore the old distributed switch after a vCenter rebuild might render the entire service inaccessible. (If you can’t re-enable it on the vSphere UI, try to call the vCenter API vim.vsan.ReconfigSpec with a different port group; there is a chance, but your mileage might vary.)
References
- The Resiliency of vSAN – Recovering my 2-Node Direct Connect While Preserving vSAN Datastore
- Configure 2-Node VSAN on ESXi Free Using CLI Without VCenter
- VMware vSAN cache disk failed and how to recover from it
- vSAN question: Restore VCSA on vSAN
- Administering VMware vSAN (PDF)
- Purge inaccessible objects in VMware vSAN
- Fixing these dratted Unknown vSAN Objects
- Fix orphaned vSAN objects
- VMware VSAN delete/purge inaccessible objects
- VMware®Ruby vSphere Console Command Reference for Virtual SAN (PDF)