Reboot issue MCE error on Dell PowerEdge R6525 running ESXi 7.0 Update 3c

We have new hardware running Dell PowerEdge R6525\AMD EPYC 7713 64-Core Processor with ESXi 7.0 Build: 19193900 Release: ESXi 7.0 Update 3c (ESXi 7.0 Update 3c) and PRD VMs were migrated to the 15 hosts cluster. After a few weeks, started noticing randomly ESXi started rebooting and after further troubleshooting, we upgraded all the hardware firmware and BIOS ( 2.5.6 – upto 2.6.0 ) but the issue didn’t fix.

After monitoring for several weeks, identified DRS rule which running the Linux VMs on certain hosts are most affected compared to windows running hosts so with the help of the vendor changed the CPU and also motherboard on a few hosts but it didn’t help.

All the hosts failed with the ERROR : (Fatal/NonRecoverable) 2 (System Event) 13 Assert + Processor Transition to Non-recoverable

The issue was escalated to the top technical team in Dell and after several months, the vendor asked us to upgrade the BIOS to the 2.6.6  and finally, it helped us to arrest the reboot.

  1. Error from ESX logs – showing memory error
    2022-04-03T05:34:42 13 – Processor 1 MEMEFGH VDD PG 0 Assert + Processor Transition to Non-recoverable
  2.  After the above error, server was running till 12PM UTC
    1. 2022-04-03T10:00:00.611Z heartbeat[2308383]: up 5d6h22m15s, 94 VMs; [[2103635 vmx 67108864kB] [2114993 vmx 134084608kB] [2105683 vmx 134090752kB]] []
      Reboot might have happened between this time

Note : We have another environment that runs the same hardware R6525 with ESXi6.7 U3 but didn’t face any issue and after several analyses, we couldn’t find any solid evidence points the issue was caused by Linux VMs or applications running on the same.

