Below is the issue we have faced after upgrading the SRM to 8.3.1
IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1
IP customization previous worked on these RHEL versions with SRM 6.5
IP customization work with RHEL 7 VMs which can utilize the SAML tokens for authentication.
It looks like changes that happened between SRM 6.5 and later versions, that caused the conflict with LDAP on your RHEL6 machines. Prior to the changes, SRM performs script transfer using the VIX protocol that has little to no authentication. This master access method worked from vCenter, where SRM would transfer the script through vCenter, and then directly to the ESXi host and eventually the VM, without any authentication or tokens involved.
For security reasons, this is obviously a weakness. This has changed and is now enforced, that instead, we use a SAML token authentication, through an SSO Solution User, that is created when SRM registers with the PSC/SSO and vCenter. This new method also meant we needed to upgrade how Tools operates and allow it to be able to be apart of that process with SSO, thus the vgAuth part of the tools.
This process now impersonates the root account to execute scripts inside the GuestOS that are directly tied to an authentication token through SSO.
Also as you see above, SRM only contacts SSO to get authentication, but outside of that, SRM itself transfers the script now to the ESXi host and then the VM, instead of vCenter doing it. This new process forces us to authenticate and use the benefits of the temporary SAML token for activities like this. This is also the exact same process if you run custom scripts inside the Guest OS on your plans.
We have seen cases where LDAP and now with you, openLDAP, cause a conflict with our ability to impersonate on the Guest OS. Unfortunately, like any other third party application or solution that conflicts with our operation needs to be addressed from the offending application itself. In this case, it appears SSSD works as proven by your tests.
Recently we moved to AMD EPYC 7713 64 with Dell R6525 and noticed ESXi hosts showing 100% CPU and it is keep on intermittently fluctuating . When we checked the performance in ESXTOP it was very low and in our other environment AS -2114GT-DNR SuperMicro with same AMD EPYC 7713P noticed similar CPU spike.
As per the KB https://kb.vmware.com/s/article/85071 it is some kind of cosmetic issue and we can safely ignore it or else there is workaround mentioned on the same.
We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.
Below are the details of the machine
Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)
Version or Release of Remoting Solution = 220.127.116.1100
Endpoint Client Information = Windows 10 x64 20h2
Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin
NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are some Xid 43 and Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue. So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.
Previous post I have explained about the vCenter and VPXD crash issue , after lot of research VMware support confirmed that the issue is because of the Likewise failure that is fixed and pushed in 6.7 U3M, since VMware patches are cumulative, we can upgrade to the latest patch which will include previous all the fixes.
Here’s the link to the fix documented in the 6.7 U3M release notes: https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-67u3m-release-notes.html#:~:text=If%20the%20identity,in%20this%20release.
Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”
Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.
After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.
After uploading the logs to VMware they identified ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.
VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.
While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.
VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters.
So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.
- Backup the vmon config:
# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak
- Edit the vmon config:
# vi /etc/vmware/vmware-vmon/config.json
- Change the following line to true,
“DumpLiveCoreOnApiHealthFail” : false,
- Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):
# service-control –restart vmware-vmon
Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.
Our vCenter was running 18.104.22.168000 old version because we have cloud stack with front end for our internal cloud which officially supports only up to vCenter Appliance 6.7 Update 3g (22.214.171.124000) so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m version , we took decision on upgrading the VC to the 3m (126.96.36.199000) version.
On the same week VMware released the security related new version 188.8.131.52000 so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.
Below grep NumSessions shows the currently active sessions on the vpxd profiler. The longer query showing HttpSessionObject is checking for all HTTP sessions that were created over a certain period of time.
The HttpSessionObject notes a unique session at a point in time in the log. We filer on this because there are many different objects noted in the profiler log that don’t relate to HTTP sessions.
Login in to SSH – cd \var\log\vmware\vpxd\
grep NumSessions vpxd-profiler*.log | less
grep ClientIP vpxd-profiler*.log | grep HttpSessionObject | grep -v com.vmware | grep -v “”” | cut -d “‘” -f 3-6 | sort | uniq –count | sort -nr | less
Recently we had “All path down issue” in one of our host and I was looking to find how many events and how long this issue was there in host and identified the below steps in the Splunk in which we can highlight the key word to find the list.We can easily get the details from the ESXi but I felt below steps will be useful for the other use cases.
Make sure we have the Add-on for VMware https://splunkbase.splunk.com/app/3215/ in splunk which is no cost and it will install the VMware sourcetype parsers.
1. Click on Event Action > Extract Fields to start the wizard
2. Select Regular Expression > highlight to select a value > name the field > continue on to validation and complete the wizard.
When you click the events it will show all the events regarding the word you highlighted .
We are running the VCSA 14367737 and it can’t be upgraded because we have the internal cloud stack on top of the vCenter and it supports only the VCSA version 14367737. I have tried forwarding the VCSA logs to the Syslog server ( SPLUNK ) and noticed after the configuration it worked for few hours and stopped working and we have to restart the service manually systemctl restart rsyslog to forward the logs again to the Splunk server.
After trying few options and in our test environment we have tried upgrading the VC to different version and noticed the issue got fixed on the vCenter Appliance 6.7 Update 3g (184.108.40.206000) 16046470.Eventough in the release notes they havent mentioned anything on this issue , it looks like they have upgraded the rsyslog version on ths VCSA version.
As the workaround we can configure the cron and restart the service for every two hours.