Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”
Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.
After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.
After uploading the logs to VMware they identified ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.
VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.
While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.
VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters.
So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.
- Backup the vmon config:
# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak
- Edit the vmon config:
# vi /etc/vmware/vmware-vmon/config.json
- Change the following line to true,
“DumpLiveCoreOnApiHealthFail” : false,
- Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):
# service-control –restart vmware-vmon
Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.
Our vCenter was running 184.108.40.206000 old version because we have cloud stack with front end for our internal cloud which officially supports only up to vCenter Appliance 6.7 Update 3g (220.127.116.11000) so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m version , we took decision on upgrading the VC to the 3m (18.104.22.168000) version.
On the same week VMware released the security related new version 22.214.171.124000 so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.