In our organization, we have restrictions to use all the services in azure and have to work with the different teams to enable the resource types in policy for the resources.
Recently I was working to add the non-azure VM in Azure ARC and it was failing with an error.
“FATAL RequestCorrelationId:281e537a-90cfa3a4003 Message: Resource ‘12345’ was disallowed by policy. ” while troubleshooting found this link which helped to find the policy and it is very easy to search
Recently we have noticed in a few datastore the search operation is taking more time than expected and in our testing, it was identified that compare to NFSv4, NFSv3 results are better.
Our testing is to first list files from the host shell and we mounted the same storage as nfsv4 and nfsv3 to the same host. We ran this command from the host shell against both storages. time ls -lahR | wc -l . nfsv4 takes 1 minute and 30 seconds to finish running this command. When it is nfsv3, it only takes 15 seconds.
Then tried using Powershell. Use the below way to search files.
The data transfer rate is pretty much the same but the slowness issue is at the list/search file for nfs4.
Based on the results we involved the vendors to investigate the issue. NetApp conducted a thorough investigation and determined there were no performance issues, but VMware acknowledged there was a problem with NFS v4.1. Below was the reply from the VMware.
“Based on the analysis, Engineering team has identified that the issue related to slow search on NFSv4.1 is caused because the NFSv4.1 does not support “Directory Name Lookup Cache(DNLC) yet. However, for NFSv3, most of the LOOKUP calls are served from cache, which avoids sending a LOOKUP instruction to the NFS server. The VMware engineering is working to add this feature for NFSv4.1 however we do not have a version confirmation where this is expected to be included.”
We applied the latest patches but the issue still exists and hopefully, in the future patch it will get fixed.
We have new hardware running Dell PowerEdge R6525\AMD EPYC 7713 64-Core Processor with ESXi 7.0 Build: 19193900 Release: ESXi 7.0 Update 3c (ESXi 7.0 Update 3c) and PRD VMs were migrated to the 15 hosts cluster. After a few weeks, started noticing randomly ESXi started rebooting and after further troubleshooting, we upgraded all the hardware firmware and BIOS ( 2.5.6 – upto 2.6.0 ) but the issue didn’t fix.
After monitoring for several weeks, identified DRS rule which running the Linux VMs on certain hosts are most affected compared to windows running hosts so with the help of the vendor changed the CPU and also motherboard on a few hosts but it didn’t help.
All the hosts failed with the ERROR : (Fatal/NonRecoverable) 2 (System Event) 13 Assert + Processor Transition to Non-recoverable
The issue was escalated to the top technical team in Dell and after several months, the vendor asked us to upgrade the BIOS to the 2.6.6 and finally, it helped us to arrest the reboot.
After the above error, server was running till 12PM UTC
2022-04-03T10:00:00.611Z heartbeat: up 5d6h22m15s, 94 VMs; [[2103635 vmx 67108864kB] [2114993 vmx 134084608kB] [2105683 vmx 134090752kB]]  Reboot might have happened between this time
Note : We have another environment that runs the same hardware R6525 with ESXi6.7 U3 but didn’t face any issue and after several analyses, we couldn’t find any solid evidence points the issue was caused by Linux VMs or applications running on the same.
NFS 4.1 datastores might become inaccessible after failover or failback operations of storage arrays.
When storage array failover or failback operations take place, NFS 4.1 datastores fall into an All-Paths-Down (APD) state. However, after the operations are complete, the datastores might remain in APD state and become inaccessible.
As per the VMware this issue is happening in hosts older than build version 16075168 and it is resolved in the newer version. We tested it in our environment and the newer version works fine without any datastore failure.
We have a new MAC MINI 2019\2020 and old model (2018 ) running with ESXi 6.7U3 and noticed on new mac-mini VMs ( MAC\Windows\Linux) having issues connecting the network and downloading the files. The only difference is the network card which is a different model.
We have tried enabling the jumbo frame on the VMs and it started working and able to download the files but couldn’t find out the exact cause for the issue because from the hypervisor or having the MAC-OS we don’t have any issue.
Still investigating the issue and workaround is to enable the jumbo frame on the VMs.
Below is the issue we have faced after upgrading the SRM to 8.3.1
IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1
IP customization previous worked on these RHEL versions with SRM 6.5
IP customization work with RHEL 7 VMs which can utilize the SAML tokens for authentication.
It looks like changes that happened between SRM 6.5 and later versions, that caused the conflict with LDAP on your RHEL6 machines. Prior to the changes, SRM performs script transfer using the VIX protocol that has little to no authentication. This master access method worked from vCenter, where SRM would transfer the script through vCenter, and then directly to the ESXi host and eventually the VM, without any authentication or tokens involved.
For security reasons, this is obviously a weakness. This has changed and is now enforced, that instead, we use a SAML token authentication, through an SSO Solution User, that is created when SRM registers with the PSC/SSO and vCenter. This new method also meant we needed to upgrade how Tools operates and allow it to be able to be apart of that process with SSO, thus the vgAuth part of the tools.
This process now impersonates the root account to execute scripts inside the GuestOS that are directly tied to an authentication token through SSO.
Also as you see above, SRM only contacts SSO to get authentication, but outside of that, SRM itself transfers the script now to the ESXi host and then the VM, instead of vCenter doing it. This new process forces us to authenticate and use the benefits of the temporary SAML token for activities like this. This is also the exact same process if you run custom scripts inside the Guest OS on your plans.
We have seen cases where LDAP and now with you, openLDAP, cause a conflict with our ability to impersonate on the Guest OS. Unfortunately, like any other third party application or solution that conflicts with our operation needs to be addressed from the offending application itself. In this case, it appears SSSD works as proven by your tests.
Recently we moved to AMD EPYC 7713 64 with Dell R6525 and noticed ESXi hosts showing 100% CPU and it is keep on intermittently fluctuating . When we checked the performance in ESXTOP it was very low and in our other environment AS -2114GT-DNR SuperMicro with same AMD EPYC 7713P noticed similar CPU spike.