Guest OS Hung for GPU Enabled VMs

We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.

Below are the details of the machine

Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)


Version or Release of Remoting Solution = 7.4.0.13800


Endpoint Client Information = Windows 10 x64 20h2

Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin

NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are  some Xid 43 and  Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue.  So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.

Advertisement
This entry was posted in Dell, HP, NVIDIA GPU, VMware and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s