vSAN 7.0U1 Cluster Rebuild: A Firsthand Experience

How It Started

I screwed up a vCenter instance. Actually it is pretty easy to screw up the state-of-the-art hypervisor controller from its beautifully designed web UI, using the appealing buttons that always have been there. The process only requires 2 simple steps:

  1. Enable vCenter HA
  2. Replace the machine SSL certificate

The vCenter HA documentation do state “if you want to use custom certificates, you have to remove the vCenter HA configuration” using the smallest font size possible, but the warning is nowhere mentioned in the documentation related to replacing SSL certificates where it should be. The UI won’t stop you from playing with fire, either.

If you have enough time and a lab environment then give it a try. The vCenter VM will reboot a few times before it completely stops working. It will still spin up, but you won’t be able to login anymore. You’ll see a very unhelpful error message on the login screen:

An error occurred when processing the metadata during vCenter Single Sign-On setup – Failed to connect to VMware Lookup Service

By the way, don’t bother trying the vSphere Certificate Manager command-line tool to unscrew the situation; that tool will refuse to do anything if it detects itself running in a HA vCenter cluster. So, if you don’t have any backup or snapshot to revert to, your vCenter is dead.

Things were a little complicated for my case: The dead vCenter VM ran on a 3-node hyperconverged cluster with HA, DRS and vSAN. As the vCenter goes down, now I have a problem.

How It’s Going

Luckily, the ESXi hypervisor is largely independent from vCenter, so I could still log in and do something on the individual hypervisors. Now I had to do something to (hopefully) make the situation better.

Preparing

The first obvious thing I did was to shut down the old vCenter VMs. These does not work anymore and might interfere with the recovery process.

Next thing I did was to do a backup of all important data on the cluster. Backing up an ESXi hypervisor is easy: mount some NFS storage on each hypervisor, and manually move/copy the VMs over. vMotion wouldn’t be available so everything had to be done by hand, when the VMs were shut down.

Then I shut down as many VMs as I can. Although there was possibility that one can rebuild the cluster while keeping some VMs running, I recommend against that.

Prepare a vCenter installer ISO on the workstation, and let’s get into the recovery process.

The First (Unsuccessful) Attempt

Being rather unfamiliar with the new vSphere 7.0, initially my strategy was to just reinstall the vCenter directly onto the vSAN storage, take over the hosts, rebuild the distributed switch by hand, and simply re-configure the cluster. The process did not work: while adding the first host, vCenter reported “Found host(s) esxi02.corp.contoso.com, esxi03.corp.contoso.com participating in the vSAN service which is not a member of this host’s vCenter cluster”, and after a few seconds, vCenter freezed. Later investigation showed that vCenter detected the host had vSAN configured, so it overwrote a single-node vSAN configuration onto that host, breaking the storage it was running on.

Now I have 2 problems: a dead vCenter, and a 3-node vSAN cluster in a split-brain situation.

The Second (Successful) Attempt

Knowing that vSAN won’t automatically delete any inaccessible/broken object, I was confident that all my data was still there, it was just the vSAN configuration that need to be fixed to at least keep the storage running. After some searching on the Internet, I found out that you can actually manage all vSAN configuration on the ESXi hypervisor host! There are some not-very-helpful official documentation on the esxcli vsan subcommand, but it was enough to get me on the correct track.

I enabled SSH on all the hosts, and issued this command to every host:

This essentially told the vSAN agent running on every host to ignore everything sent by any vCenter. Now that the “manual transmission” mode is engaged, I started to recover the vSAN.

First let’s confirm the status:

We indeed had a split brain. Then kick esxi01 out of the imaginary one-node cluster (it is a very slow process, have some patience), and re-join it with the correct sub-cluster UUID from the other hosts’ config:

A vCenter configured vSAN cluster would be in the unicast mode (i.e. peer discovery depends on the IP list sent by the control plane), so we also need to synchronize the IP address list of the cluster on every host. Verify the VMKernel adapter for vSAN is set up on esxi01:

If you don’t see “vsan” traffic type in the output, reconfigure your VMKernel adapter. Since esxi02 and esxi03 already know each other, we can concentrate the list from the 2 hosts…

then play them back onto esxi01 (if you have vSAN witness applications, you need to slightly change the arguments here):

As esxi01’s IP addresses are not changed, no changes are needed on the other 2 hosts. Let’s verify if vSAN is up and running again.

Yay!

Rest of the steps are pretty straightforward. The key takeaway here is: to join a host to a cluster, it must be either in the maintenance mode (i.e. all VMs shut off), or only have vCenter running on it. All other steps are essential to solve the chicken-and-egg problem.

  1. Shutdown all the VMs running on the hosts if you haven’t already done this
  2. Find the node with the oldest CPU (assume it is esxi01), and if possible, connect a temporary non-vSAN datastore (NFS or a local storage device)
  3. install vCenter onto esxi01 using the temporary datastore
  4. Set up vCenter (networking, admin user, certificate)
  5. Add esxi01 to the vCenter, put it in a new cluster, you can use the cluster quickstart wizard but do not let it configure networking for you
  6. Enable VMWare EVC on the new cluster
  7. If you have a backup for distributed switch config, restore it; otherwise configure a new distributed switch
  8. Add another host (say, esxi02) to the vCenter, do not add it to a cluster yet
  9. Add esxi02 to the distribute switch and migrate all adapters
  10. vMotion the vCenter VM to esxi02
  11. Add esxi01 and esxi03 to the distribute switch and migrate all adapters
  12. Go to the web portal of esxi02 and esxi03, put them into maintenance mode, set vSAN migration mode to “no data migration” (do not use vCenter to put them into maintenance mode as this will cause vSAN to evict data; also, this will temporary block all requests to the vSAN datastore, so make sure nothing is running on it)
  13. Add esxi02 and esxi03 to the cluster and configure the cluster in the quickstart wizard
  14. If this caused vSAN to move some data back and forth, wait for the migration to finish
  15. Verify all objects in vSAN is readable, and try restart the VMs
  16. vMotion the vCenter back onto the vSAN datastore

Now we have a new vCenter server and a new cluster good to go.

Cleaning Up

If you still want to configure vSAN from vCenter later, first execute the following command on every ESXi host:

This allows the vSAN agent to receive further configuration from the vCenter. Then let vCenter synchronize once with all the hosts: Cluster -> Monitor -> Skyline Health -> vCenter state is authoritative -> click on “UPDATE ESXI CONFIGURATION”.

If you have custom storage policies, you can restore them using the following command in vCenter ruby console:

vSAN default policy will be created automatically.

If you have any inaccessable object, SSH login to one of the hosts containing that object, then delete it manually:

The following things will require a rebuilt by hand in the new vCenter:

  • users, groups, permissions
  • content libraries
  • host profiles
  • HA & DRS
  • VM rules

If you have vSAN file services configured, you might need to re-enable them from vCenter. You will need to re-upload the OVAs, and you won’t be able to change the configuration. Note that vSAN file services 7.0U1 is extremely buggy and locked itself up (I can’t enable it/disable it/configure it/use it) on my cluster, so I currently do not recommend using it in production.

If you have some “Unable to connect to MKS” error when connecting to VM consoles on the new vCenter: see “Unable to connect to MKS” error in vSphere Web Client (2115126)

Final Thoughts

One thing I like about vSphere is its ability to continue functioning without a centralized control plane. HA, multiple-access datastores, and vSAN are all designed around this basic assumption and this have saved me many times. On the other hand, vCenter is a fragile thing, and vCenter 7.0, with a lot legacy Java components being rewritten by Python, is much more fragile than ever before.

Always export and backup your distributed switch config, even if you have automated backup for vCenter. This will save you a lot time in case you must set up a new vCenter. If you have vSAN file services configured, failing to restore the old distributed switch after a vCenter rebuild might render the entire service inaccessible. (If you can’t re-enable it on the vSphere UI, try to call the vCenter API vim.vsan.ReconfigSpec  with a different port group; there is a chance, but your mileage might vary.)

References

Leave a Reply

Your email address will not be published. Required fields are marked *