Guest OS Hung for GPU Enabled VMs

We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.

Below are the details of the machine

Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)


Version or Release of Remoting Solution = 7.4.0.13800


Endpoint Client Information = Windows 10 x64 20h2

Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin

NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are  some Xid 43 and  Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue.  So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.

Posted in Dell, HP, NVIDIA GPU, VMware | Tagged , , | Leave a comment

Root cause for the VPXD Crash issue

Previous post I have explained about the vCenter and VPXD crash issue , after lot of research VMware support confirmed that the issue is because of the  Likewise failure that is fixed and pushed in 6.7 U3M, since VMware patches are cumulative, we can upgrade to the latest patch which will include previous all the fixes.

Here’s the link to the fix documented in the 6.7 U3M release notes: https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-67u3m-release-notes.html#:~:text=If%20the%20identity,in%20this%20release.

Posted in VCSA6.7, VMware | Tagged , , , , | Leave a comment

VC6.7 U3 VPXD crash issue.

Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”

Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.

After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of  cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.

After uploading the logs to VMware they identified  ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.

VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.

While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.

VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters. 

So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.

  • Backup the vmon config:

# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak

  • Edit the vmon config:

# vi /etc/vmware/vmware-vmon/config.json

  • Change the following line to true,

“DumpLiveCoreOnApiHealthFail” : false,

  • Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):

# service-control –restart vmware-vmon

Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.

Our vCenter was running 6.7.0.44000 old version because we have cloud stack with front end for our internal cloud  which officially supports only up to vCenter Appliance 6.7 Update 3g (6.7.0.44000) so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m  version , we took decision on upgrading the VC to the 3m (6.7.0.47000) version.

On the same week VMware released the security related new version 6.7.0.48000 so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.

Posted in Vcenter Appliance, vCSA 6.0, VCSA6.5, VCSA6.7, VMware, VPXD | Tagged , , , | 1 Comment

Tips to check VCSA sessions.

Below grep NumSessions shows the currently active sessions on the vpxd profiler. The longer query showing HttpSessionObject is checking for all HTTP sessions that were created over a certain period of time.

The HttpSessionObject notes a unique session at a point in time in the log. We filer on this because there are many different objects noted in the profiler log that don’t relate to HTTP sessions.

Login in to SSH – cd \var\log\vmware\vpxd\

grep NumSessions vpxd-profiler*.log | less

grep ClientIP vpxd-profiler*.log | grep HttpSessionObject | grep -v com.vmware | grep -v “”” | cut -d “‘” -f 3-6 | sort | uniq –count | sort -nr | less

Reference : https://www.youtube.com/watch?v=eFM_ewwy2ys&ab_channel=VMworld

Posted in vCSA 6.0, VCSA6.7, VMware | Tagged , , , , | Leave a comment

Issue on SSH login to the ESXi 6.7 host with the AD user account

We were not able to login to the ESXi ssh using the AD account and when we tried to leave the account or add the domain it is getting failed.

[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
 - While adding the host got below error:
 
Error: LW_ERROR_LDAP_CONSTRAINT_VIOLATION 


Deleted the stale entry/ESXi computer account from Active Directory.

Post deleting the account, ESXi was successfully able to leave the domain. Used below command to leave the domain:
 
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli leave
- Used below command to add the ESXi back to the domain which was successful. 
 
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
Joining to AD Domain:   prd.com
With Computer DNS Name: esx.prd.com
 SUCCESS
- Post joining the ESXi to domain team was successfully able to login to ESXi host using domain user account.

Posted in ESXi issue, VMware | Tagged , , | Leave a comment

Tip to check the ESXi\vCenter errors using the Splunk.

Recently we had “All path down issue” in one of our host and I was looking to find how many events and how long this issue was there in host and identified the below steps in the Splunk in which we can highlight the key word to find the list.We can easily get the details from the ESXi but I felt below steps will be useful for the other use cases.

Make sure we have the Add-on for VMware https://splunkbase.splunk.com/app/3215/ in splunk which is no cost and it will install the VMware sourcetype parsers.

1. Click on Event Action > Extract Fields to start the wizard

2. Select Regular Expression > highlight to select a value > name the field > continue on to validation and complete the wizard.

When you click the events it will show all the events regarding the word you highlighted .

Useful Links:

https://splunkbase.splunk.com/app/3975/

Posted in logs, vCSA 6.0, VCSA6.5, VCSA6.7, VMware | Tagged , | Leave a comment

Bug noticed on VCSA 14367737 Syslog configuration.

We are running the VCSA 14367737 and it can’t be upgraded because we have the internal cloud stack on top of the vCenter and it supports only the VCSA version 14367737. I have tried forwarding the VCSA logs to the Syslog server ( SPLUNK ) and noticed after the configuration it worked for few hours and stopped working and we have to restart the service manually systemctl restart rsyslog to forward the logs again to the Splunk server.

After trying few options and in our test environment we have tried upgrading the VC to different version and noticed the issue got fixed on the vCenter Appliance 6.7 Update 3g (6.7.0.44000) 16046470.Eventough in the release notes they havent mentioned anything on this issue , it looks like they have upgraded the rsyslog version on ths VCSA version.

As the workaround we can configure the cron and restart the service for every two hours.

Posted in VCSA6.7, VMware | Tagged , , | 1 Comment

Packet drop issue on HP Gen 9 \ Gen 10 servers running ESXi6.7.

We have noticed the packet drop on all of our HP BL460c Gen 9 \ Gen 10 across the region which is running ESXi, 6.7.0, 16316930 and the Network adapter presently installed on the server is HPE FlexFabric 10Gb 2-port 536FLB Adapter which is Qlogic Adapter.

Version which comes with the HP custom image includes the qfle3  driver version 1.1.6.0-1OEM.650.0.0.4598673 and We have tried updating the driver \ firmware of the HP Enclosure \ OA \ Virtual Connect to the below versions but didn’t fix the issue.

OA Firmware : 4.96

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_8e583ffa28874a53aa272b959b

      3.. Upgrade the Virtual connect firmware on one switch and another switch.

HP Virtual Connect Firmware: 4.85

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_f99f0bc5bfc4414aac021f81af#tab3

Solution:

After a lot of options tried, HP has recommended installing the below driver version and packet drop issue has been fixed.

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_fca9a16a601345919247b0c240#tab-history.

[root@esx106:~] esxcli software vib install -v “/cp039955/QLogic-Network-iSCSI-FCoE-v2.0.102-14793946/QLogic-Network-iSCSI-FCoE-v2.0.102-offline_bundle-14793946/vib20/qfle3/QLC_bootbank_qfle3_1.0.87.0-1OEM.670.0.0.8169922.vib”

Installation Result

   Message: Host is not changed. Reboot is pending from previous transaction.

   Reboot Required: true

   VIBs Installed:

   VIBs Removed:

   VIBs Skipped: QLC_bootbank_qfle3_1.0.87.0-1OEM.670.0.0.8169922

I think as per the ESXi Patch advisory it is mentioned QFLE3 is 1.0.50.11-9vmw.670.0.0.8169922 so we need to have something near to the version and once installed the driver recommended by HP which is 1.0.87.0-1OEM.670.0.0.8169922 fixed the issue for us.

Posted in ESXi issue, ESXi Patches, HP, VMware | Tagged , , , , , , , | 4 Comments

Memories of 2020

We have started migrating our Tier-1 infrastructure to AWS and most of our internal applications have been moved to the cloud and learned lot of new AWS services and new technologies.

Around SEP, I was requested to support our internal cloud team which is running cloud-stack with VMware and after a very big gap again back to VMware and virtualization technology. Initially it was very difficult for the change over but now very much back in to the track.

After moving to the internal cloud team , I got the opportunity to take care of the Scrum master role and started doing the same and planning to finish the certification.

Even tough because of the pandemic there were lot of challenges , I had a good 2020 in my professional life and looking forward for the new year 2021..

Posted in Uncategorized | Tagged , , | Leave a comment

Easy way to uninstall the Trend Deep Security agent.

I was searching the easy way to uninstall the Trend Agent on the windows 10 and find the below command useful.

Get-Package -Name  “Trend Micro Deep Security Agent” | Uninstall-Package

Or

msiexec.exe /x <exact MSI package name>.msi /quiet

Reference:

https://success.trendmicro.com/solution/1055096-performing-silent-uninstallation-of-deep-security-agent-dsa-from-windows-machine

Posted in Trend Micro Deep Security, Trend Micro Deep Security 9.5 ( Deep Security Agent ) | Leave a comment