NFS 4.1 brought a lot new features comparing to its predecessor NFS 3: Multipath, “reliable” transmission based on TCP, and better authentication support. And ESXi comes with native support for NFS 4.1. But what if I tell you that, under a very specific circumstance, NFS 4.1 will result in all your VMs being forcefully powered off, and causing data loss?
TL; DR: Avoid NFS 4.1, stick with NFS 3 on your vSphere cluster, especially if your storage backend have relatively poor performance.
How It Started
I screwed up a vCenter instance. Actually it is pretty easy to screw up the state-of-the-art hypervisor controller from its beautifully designed web UI, using the appealing buttons that always have been there. The process only requires 2 simple steps:
So you have a handful of brand new ESXi servers, and want VMs to automagically move here and there based on host availability and resource usage; vCenter have you covered with the DRS and HA but obviously you need to put all the hosts into a cluster for these thing to work. What you might not know is that there are 3 ways of creating a cluster which differs in certain things, and you will regret it if you choose the wrong one. Trust me, I learned it the hard way.
Note: we are using ESXi 7.0 and vCenter 7.0 here.