Reboot issue MCE error on Dell PowerEdge R6525 running ESXi 7.0 Update 3c

We have new hardware running Dell PowerEdge R6525\AMD EPYC 7713 64-Core Processor with ESXi 7.0 Build: 19193900 Release: ESXi 7.0 Update 3c (ESXi 7.0 Update 3c) and PRD VMs were migrated to the 15 hosts cluster. After a few weeks, started noticing randomly ESXi started rebooting and after further troubleshooting, we upgraded all the hardware firmware and BIOS ( 2.5.6 – upto 2.6.0 ) but the issue didn’t fix.

After monitoring for several weeks, identified DRS rule which running the Linux VMs on certain hosts are most affected compared to windows running hosts so with the help of the vendor changed the CPU and also motherboard on a few hosts but it didn’t help.

All the hosts failed with the ERROR : (Fatal/NonRecoverable) 2 (System Event) 13 Assert + Processor Transition to Non-recoverable

The issue was escalated to the top technical team in Dell and after several months, the vendor asked us to upgrade the BIOS to the 2.6.6  and finally, it helped us to arrest the reboot.

  1. Error from ESX logs – showing memory error
    2022-04-03T05:34:42 13 – Processor 1 MEMEFGH VDD PG 0 Assert + Processor Transition to Non-recoverable
  2.  After the above error, server was running till 12PM UTC
    1. 2022-04-03T10:00:00.611Z heartbeat[2308383]: up 5d6h22m15s, 94 VMs; [[2103635 vmx 67108864kB] [2114993 vmx 134084608kB] [2105683 vmx 134090752kB]] []
      Reboot might have happened between this time

Note : We have another environment that runs the same hardware R6525 with ESXi6.7 U3 but didn’t face any issue and after several analyses, we couldn’t find any solid evidence points the issue was caused by Linux VMs or applications running on the same.

Advertisement
Posted in Dell | Tagged , , , , | Leave a comment

NFS 4.1 datastores might become inaccessible after failover or failback operations of storage arrays

NFS 4.1 datastores might become inaccessible after failover or failback operations of storage arrays.

When storage array failover or failback operations take place, NFS 4.1 datastores fall into an All-Paths-Down (APD) state. However, after the operations are complete, the datastores might remain in APD state and become inaccessible.


As per the VMware this issue is happening in hosts older than build version 16075168 and it is resolved in the newer version. We tested it in our environment and the newer version works fine without any datastore failure.

Posted in Storage, Storage\Backup, VMware | Tagged , , , | Leave a comment

VCSA upgrade stuck in 88%


VCSA 7.0 U3b upgrade stuck in 88%

Resolution

Vami page was stuck at 88% for more than a hour.

Removed the update_config file and restarted the VAMI, but update was not done.

Downloaded the fp.iso patch and patched the VCSA via VAMI successfully.

Posted in VMware | Tagged , , | 2 Comments

VMs running on new MAC-Mini ESXi network issues.

We have a new MAC MINI 2019\2020 and old model (2018 ) running with ESXi 6.7U3 and noticed on new mac-mini VMs ( MAC\Windows\Linux) having issues connecting the network and downloading the files. The only difference is the network card which is a different model.

We have tried enabling the jumbo frame on the VMs and it started working and able to download the files but couldn’t find out the exact cause for the issue because from the hypervisor or having the MAC-OS we don’t have any issue.

Still investigating the issue and workaround is to enable the jumbo frame on the VMs.

Posted in MacMini, MacMini, VMware | Tagged , | Leave a comment

Ports required for the AD

Lots of links talk about the ports required for the AD connection and in my environment below ports are enabled and able to add the client to the AD with DNS registred.

TCP_636
  TCP_3268
  TCP_3269
TCP_88
  UDP_88
  TCP_53
UDP_53
  TCP_445
 UDP_445
  TCP_25
TCP_135
TCP_5722
  UDP_123
  TCP_464
  UDP_464
  UDP_138
  TCP_9389
  UDP_137
 TCP_139
UDP_49152-65535
  TCP_49152-65535


Refer:

https://isc.sans.edu/diary/Cyber+Security+Awareness+Month+-+Day+27+-+Active+Directory+Ports/7468

Posted in AWS, Azure, Cloud | Leave a comment

IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1

Below is the issue we have faced after upgrading the SRM to 8.3.1

IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1

IP customization previous worked on these RHEL versions with SRM 6.5

IP customization work with RHEL 7 VMs which can utilize the SAML tokens for authentication.

It looks like changes that happened between SRM 6.5 and later versions, that caused the conflict with LDAP on your RHEL6 machines. Prior to the changes, SRM performs script transfer using the VIX protocol that has little to no authentication. This master access method worked from vCenter, where SRM would transfer the script through vCenter, and then directly to the ESXi host and eventually the VM, without any authentication or tokens involved.

For security reasons, this is obviously a weakness. This has changed and is now enforced, that instead, we use a SAML token authentication, through an SSO Solution User, that is created when SRM registers with the PSC/SSO and vCenter. This new method also meant we needed to upgrade how Tools operates and allow it to be able to be apart of that process with SSO, thus the vgAuth part of the tools. 

This process now impersonates the root account to execute scripts inside the GuestOS that are directly tied to an authentication token through SSO.

Also as you see above, SRM only contacts SSO to get authentication, but outside of that, SRM itself transfers the script now to the ESXi host and then the VM, instead of vCenter doing it. This new process forces us to authenticate and use the benefits of the temporary SAML token for activities like this. This is also the exact same process if you run custom scripts inside the Guest OS on your plans.

We have seen cases where LDAP and now with you, openLDAP, cause a conflict with our ability to impersonate on the Guest OS. Unfortunately, like any other third party application or solution that conflicts with our operation needs to be addressed from the offending application itself. In this case, it appears SSSD works as proven by your tests.

Posted in SRM, VMware | Tagged , , , , | Leave a comment

Bug in vCenter running AMD EPYC Zen3 (Milan)AMD and EPYC 7713.

Recently we moved to AMD EPYC 7713 64 with Dell R6525 and noticed ESXi hosts showing 100% CPU and it is keep on intermittently fluctuating . When we checked the performance in ESXTOP it was very low and in our other environment AS -2114GT-DNR  SuperMicro with same AMD EPYC 7713P noticed similar CPU spike.

As per the KB https://kb.vmware.com/s/article/85071 it is some kind of cosmetic issue and we can safely ignore it or else there is workaround mentioned on the same.

Posted in VMware | Tagged , | Leave a comment

Guest OS Hung for GPU Enabled VMs

We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.

Below are the details of the machine

Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)


Version or Release of Remoting Solution = 7.4.0.13800


Endpoint Client Information = Windows 10 x64 20h2

Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin

NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are  some Xid 43 and  Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue.  So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.

Posted in Dell, HP, NVIDIA GPU, VMware | Tagged , , | Leave a comment

Root cause for the VPXD Crash issue

Previous post I have explained about the vCenter and VPXD crash issue , after lot of research VMware support confirmed that the issue is because of the  Likewise failure that is fixed and pushed in 6.7 U3M, since VMware patches are cumulative, we can upgrade to the latest patch which will include previous all the fixes.

Here’s the link to the fix documented in the 6.7 U3M release notes: https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-67u3m-release-notes.html#:~:text=If%20the%20identity,in%20this%20release.

Posted in VCSA6.7, VMware | Tagged , , , , | Leave a comment

VC6.7 U3 VPXD crash issue.

Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”

Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.

After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of  cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.

After uploading the logs to VMware they identified  ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.

VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.

While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.

VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters. 

So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.

  • Backup the vmon config:

# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak

  • Edit the vmon config:

# vi /etc/vmware/vmware-vmon/config.json

  • Change the following line to true,

“DumpLiveCoreOnApiHealthFail” : false,

  • Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):

# service-control –restart vmware-vmon

Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.

Our vCenter was running 6.7.0.44000 old version because we have cloud stack with front end for our internal cloud  which officially supports only up to vCenter Appliance 6.7 Update 3g (6.7.0.44000) so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m  version , we took decision on upgrading the VC to the 3m (6.7.0.47000) version.

On the same week VMware released the security related new version 6.7.0.48000 so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.

Posted in Vcenter Appliance, vCSA 6.0, VCSA6.5, VCSA6.7, VMware, VPXD | Tagged , , , | 1 Comment