One of our internal cloud environment which is running ESXi hosts under SuperMicro ( SYS-1027GR ) got failed with the PSOD referencing Fatal (unrecoverable) MCE on one of physical CPU .It happened almost all the ESX hosts on the cluster frequently but not always right away after the restart.
Sample from the log:
cpu2:33455)@BlueScreen: Machine Check Exception: Fatal (unrecoverable) MCE on PCPU2 in world 33455:vmnic1-pollW System has encountered a Hardware Error – Please contact the hardware vendor
Raised the support case with SuperMicro and VMware , after long investigation VMware Engineer identified it is known BUG in the SuperMicro Servers if we running any VM in a nested manner.The issue is due to Intel erratum (which Intel has to acknowledge and release microcode/BIOS fix) this is a long-term solution to this issue. It seems that nested virtualization combined with PCI passthrough caused some errata in the CPU microcode on the Intel CPUs to make the hosts crash. 1603071 is the related bug number mentioned by VMware.
To fix this PSOD, temporarily we have disabled the Hardware Virtualization from the VMs option. Working with SuperMicro and VMware for the workaround and lon-term solution.