We are having NVIDIATesla V100-PCIE-32GB running SYS-1029GQ-TRT and for past few months we are facing the issue on VMs getting hung and it went to black screen and only option is to reboot the VM.
Below are the details of the machine
Remoting Solution / Method of connecting to VM = HP RGS (HP Remote Graphics Receiver)
Version or Release of Remoting Solution = 184.108.40.20600
Endpoint Client Information = Windows 10 x64 20h2
Number of displays / Display resolution = Single Display. (2560×1600)
Type (Thin, Fat, Mobile Client) = Thin
NVIDIA analyzed the logs and noticed the error NVOS status 0x19 is repeated multiple time with VGPU message 21, VGPU message 52 . This is known issue – 5.11. 11.0 Only: Failure to allocate resources causes VM failures or crashes and got resolved in vGPU Software 11.1 . There are some Xid 43 and Timeout detection and recovery (TDR) errors also seen in logs but they can be side effect of this main issue. So go ahead and install latest vGPU Software 11.5 on both esxi host and guests/VMs. You will find drivers at – https://ui.licensing.nvidia.com/software .After updating the driver , issue got fixed.