VMs running on new MAC-Mini ESXi network issues.

We have a new MAC MINI 2019\2020 and old model (2018 ) running with ESXi 6.7U3 and noticed on new mac-mini VMs ( MAC\Windows\Linux) having issues connecting the network and downloading the files. The only difference is the network card which is a different model.

We have tried enabling the jumbo frame on the VMs and it started working and able to download the files but couldn’t find out the exact cause for the issue because from the hypervisor or having the MAC-OS we don’t have any issue.

Still investigating the issue and workaround is to enable the jumbo frame on the VMs.

Posted in MacMini, MacMini, VMware | Tagged , | Leave a comment

Ports required for the AD

Lots of links talk about the ports required for the AD connection and in my environment below ports are enabled and able to add the client to the AD with DNS registred.




Posted in AWS, Azure, Cloud | Leave a comment

IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1

Below is the issue we have faced after upgrading the SRM to 8.3.1

IP customization is falling on RHEL 5 and 6 VM with SRM 8.3.1

IP customization previous worked on these RHEL versions with SRM 6.5

IP customization work with RHEL 7 VMs which can utilize the SAML tokens for authentication.

It looks like changes that happened between SRM 6.5 and later versions, that caused the conflict with LDAP on your RHEL6 machines. Prior to the changes, SRM performs script transfer using the VIX protocol that has little to no authentication. This master access method worked from vCenter, where SRM would transfer the script through vCenter, and then directly to the ESXi host and eventually the VM, without any authentication or tokens involved.

For security reasons, this is obviously a weakness. This has changed and is now enforced, that instead, we use a SAML token authentication, through an SSO Solution User, that is created when SRM registers with the PSC/SSO and vCenter. This new method also meant we needed to upgrade how Tools operates and allow it to be able to be apart of that process with SSO, thus the vgAuth part of the tools. 

This process now impersonates the root account to execute scripts inside the GuestOS that are directly tied to an authentication token through SSO.

Also as you see above, SRM only contacts SSO to get authentication, but outside of that, SRM itself transfers the script now to the ESXi host and then the VM, instead of vCenter doing it. This new process forces us to authenticate and use the benefits of the temporary SAML token for activities like this. This is also the exact same process if you run custom scripts inside the Guest OS on your plans.

We have seen cases where LDAP and now with you, openLDAP, cause a conflict with our ability to impersonate on the Guest OS. Unfortunately, like any other third party application or solution that conflicts with our operation needs to be addressed from the offending application itself. In this case, it appears SSSD works as proven by your tests.

Posted in SRM, VMware | Tagged , , , , | Leave a comment

Bug in vCenter running AMD EPYC Zen3 (Milan)AMD and EPYC 7713.

Recently we moved to AMD EPYC 7713 64 with Dell R6525 and noticed ESXi hosts showing 100% CPU and it is keep on intermittently fluctuating . When we checked the performance in ESXTOP it was very low and in our other environment AS -2114GT-DNR  SuperMicro with same AMD EPYC 7713P noticed similar CPU spike.

As per the KB https://kb.vmware.com/s/article/85071 it is some kind of cosmetic issue and we can safely ignore it or else there is workaround mentioned on the same.

Posted in VMware | Tagged , | Leave a comment

Guest OS Hung for GPU Enabled VMs

We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.

Below are the details of the machine

Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)

Version or Release of Remoting Solution =

Endpoint Client Information = Windows 10 x64 20h2

Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin

NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are  some Xid 43 and  Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue.  So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.

Posted in Dell, HP, NVIDIA GPU, VMware | Tagged , , | Leave a comment

Root cause for the VPXD Crash issue

Previous post I have explained about the vCenter and VPXD crash issue , after lot of research VMware support confirmed that the issue is because of the  Likewise failure that is fixed and pushed in 6.7 U3M, since VMware patches are cumulative, we can upgrade to the latest patch which will include previous all the fixes.

Here’s the link to the fix documented in the 6.7 U3M release notes: https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-67u3m-release-notes.html#:~:text=If%20the%20identity,in%20this%20release.

Posted in VCSA6.7, VMware | Tagged , , , , | Leave a comment

VC6.7 U3 VPXD crash issue.

Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”

Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.

After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of  cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.

After uploading the logs to VMware they identified  ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.

VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.

While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.

VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters. 

So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.

  • Backup the vmon config:

# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak

  • Edit the vmon config:

# vi /etc/vmware/vmware-vmon/config.json

  • Change the following line to true,

“DumpLiveCoreOnApiHealthFail” : false,

  • Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):

# service-control –restart vmware-vmon

Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.

Our vCenter was running old version because we have cloud stack with front end for our internal cloud  which officially supports only up to vCenter Appliance 6.7 Update 3g ( so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m  version , we took decision on upgrading the VC to the 3m ( version.

On the same week VMware released the security related new version so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.

Posted in Vcenter Appliance, vCSA 6.0, VCSA6.5, VCSA6.7, VMware, VPXD | Tagged , , , | 1 Comment

Tips to check VCSA sessions.

Below grep NumSessions shows the currently active sessions on the vpxd profiler. The longer query showing HttpSessionObject is checking for all HTTP sessions that were created over a certain period of time.

The HttpSessionObject notes a unique session at a point in time in the log. We filer on this because there are many different objects noted in the profiler log that don’t relate to HTTP sessions.

Login in to SSH – cd \var\log\vmware\vpxd\

grep NumSessions vpxd-profiler*.log | less

grep ClientIP vpxd-profiler*.log | grep HttpSessionObject | grep -v com.vmware | grep -v "''" | cut -d "'" -f 3-6 | sort | uniq --count | sort -nr | less

Reference : https://www.youtube.com/watch?v=eFM_ewwy2ys&ab_channel=VMworld

Posted in vCSA 6.0, VCSA6.7, VMware | Tagged , , , , | Leave a comment

Issue on SSH login to the ESXi 6.7 host with the AD user account

We were not able to login to the ESXi ssh using the AD account and when we tried to leave the account or add the domain it is getting failed.

[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
 - While adding the host got below error:

Deleted the stale entry/ESXi computer account from Active Directory.

Post deleting the account, ESXi was successfully able to leave the domain. Used below command to leave the domain:
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli leave
- Used below command to add the ESXi back to the domain which was successful. 
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
Joining to AD Domain:   prd.com
With Computer DNS Name: esx.prd.com
- Post joining the ESXi to domain team was successfully able to login to ESXi host using domain user account.

Posted in ESXi issue, VMware | Tagged , , | Leave a comment

Tip to check the ESXi\vCenter errors using the Splunk.

Recently we had “All path down issue” in one of our host and I was looking to find how many events and how long this issue was there in host and identified the below steps in the Splunk in which we can highlight the key word to find the list.We can easily get the details from the ESXi but I felt below steps will be useful for the other use cases.

Make sure we have the Add-on for VMware https://splunkbase.splunk.com/app/3215/ in splunk which is no cost and it will install the VMware sourcetype parsers.

1. Click on Event Action > Extract Fields to start the wizard

2. Select Regular Expression > highlight to select a value > name the field > continue on to validation and complete the wizard.

When you click the events it will show all the events regarding the word you highlighted .

Useful Links:


Posted in logs, vCSA 6.0, VCSA6.5, VCSA6.7, VMware | Tagged , | Leave a comment