VC6.7 U3 VPXD crash issue.

Every week we had VC down issue and when we login using the SSH , noticed the VPXD service was stopped and when tried to stop it was getting failed. Using the vMon log , noticed it was showing the permission error on the cloudvm-ram-size.log which supposed to be “root cis” but it got changed to the “root root”. Once we change the permission to “root cis” it is allowing to start the VPXD service and VC came online.All our other VC is having “root cis” but VMware Engineer confirmed that in their lab VC few VCs are “root root” and others were “root cis”

Same kind of symptoms were happening every week and we couldn’t find what is causing the permission issue and involved the VMware support but no luck. At one point we moved from the permission issue and started looking for other issue and involved senior VMware engineer but couldn’t find any concert root cause for the VPXD crash.

After few weeks we have noticed our VC suddenly became very slow and after the login credentials it was searching something and cant login , at one point VPXD got stopped and this time we tried stopping the entire service and started all the service without changing the file permission of  cloudvm-ram-size. All the services including the VPXD started without any issue so our focus changed from the file permission to the VPXD crash.

After uploading the logs to VMware they identified  ssome repeated offline storage IOFilterProviders on multiple vCenters, which is something they asked to clean up just to eliminate some unnecessary warnings KB: https://kb.vmware.com/s/article/76633?lang=en_US and also from the session they identified lot of request from one particular IP which is hitting the VC and identified it is from our dashboard reporting server and we stooped the service to eliminate the issue.

VC was running fine after we stopped the reporting server from hitting the VC but again after around one week VC became very slow and this time we noticed all other 4 VCs which is connected to the same domain vsphere.local went slow and almost took 15 mins for other VCs to recover and came normal but the problematic VC went down with same symptoms . This time ticket was escalated and VMware engineering team got involved and started reviewing the logs.

While engineering is completing their review for the permission changes, we can most certainly tidy up the SSO environment. VCenter log bundles indicate the presence of external PSCs. We should be cleaning up any and all residual references to these external PSCs as all 5 vCenters should be running on an embedded PSC.It looks like even though long time back we retired the external PSC the old entries still present on all the VCs.

VMware recommended to take either powered off snapshots or if we are unable to take that downtime, we can stop the vmdird service on all 5 vCenters, take the 5 snapshots, and then start the vmdird service on all 5 vCenters. 

So as per VMware we cleaned up the old PSC and again VC was running without any issue for one week but still ended up with same issue.This time VMware recommended to enable the below option to capture the VPXD dump for more details.

  • Backup the vmon config:

# cp /etc/vmware/vmware-vmon/config.json /etc/vmware/vmware-vmon/config.json.bak

  • Edit the vmon config:

# vi /etc/vmware/vmware-vmon/config.json

  • Change the following line to true,

“DumpLiveCoreOnApiHealthFail” : false,

  • Restart the vmon service (this will take down the vpxd and vsphere-ui services, so scheduled downtime is recommended):

# service-control –restart vmware-vmon

Again same story VCenter went down again and from the VPXD crash dump, VMware requested to detect the latency between vCenter server and AD ( https://kb.vmware.com/s/article/79317) if the issue happens again to check if the slowness is because of the domain controller but things went worst like every two days VPXD started crashing and still the investigation was on without any fix.

Our vCenter was running 6.7.0.44000 old version because we have cloud stack with front end for our internal cloud  which officially supports only up to vCenter Appliance 6.7 Update 3g (6.7.0.44000) so we don’t want to upgrade our VC without vendor approval , since the situation went very bad and also noticed few VPXD related fixes in the vCenter Appliance 6.7 Update 3m  version , we took decision on upgrading the VC to the 3m (6.7.0.47000) version.

On the same week VMware released the security related new version 6.7.0.48000 so we decided to go with the latest version and upgraded all the VCs. After the upgrade, for past more than 3 weeks VC is running fine without any issue which is our recent vCenter highest uptime but still, we are not sure why the VPXD was crashing and reason for the cloudvm-ram-size.log permission change , finally upgrade made the environment stable and back to the track.

Posted in Vcenter Appliance, vCSA 6.0, VCSA6.5, VCSA6.7, VMware, VPXD | Tagged , , , | Leave a comment

Tips to check VCSA sessions.

Below grep NumSessions shows the currently active sessions on the vpxd profiler. The longer query showing HttpSessionObject is checking for all HTTP sessions that were created over a certain period of time.

The HttpSessionObject notes a unique session at a point in time in the log. We filer on this because there are many different objects noted in the profiler log that don’t relate to HTTP sessions.

Login in to SSH – cd \var\log\vmware\vpxd\

grep NumSessions vpxd-profiler*.log | less

grep ClientIP vpxd-profiler*.log | grep HttpSessionObject | grep -v com.vmware | grep -v “”” | cut -d “‘” -f 3-6 | sort | uniq –count | sort -nr | less

Reference : https://www.youtube.com/watch?v=eFM_ewwy2ys&ab_channel=VMworld

Posted in vCSA 6.0, VCSA6.7, VMware | Tagged , , , , | Leave a comment

Issue on SSH login to the ESXi 6.7 host with the AD user account

We were not able to login to the ESXi ssh using the AD account and when we tried to leave the account or add the domain it is getting failed.

[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
 - While adding the host got below error:
 
Error: LW_ERROR_LDAP_CONSTRAINT_VIOLATION 


Deleted the stale entry/ESXi computer account from Active Directory.

Post deleting the account, ESXi was successfully able to leave the domain. Used below command to leave the domain:
 
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli leave
- Used below command to add the ESXi back to the domain which was successful. 
 
[root@esx:~] /usr/lib/vmware/likewise/bin/domainjoin-cli join prd.com admin
Joining to AD Domain:   prd.com
With Computer DNS Name: esx.prd.com
 SUCCESS
- Post joining the ESXi to domain team was successfully able to login to ESXi host using domain user account.

Posted in ESXi issue, VMware | Tagged , , | Leave a comment

Tip to check the ESXi\vCenter errors using the Splunk.

Recently we had “All path down issue” in one of our host and I was looking to find how many events and how long this issue was there in host and identified the below steps in the Splunk in which we can highlight the key word to find the list.We can easily get the details from the ESXi but I felt below steps will be useful for the other use cases.

Make sure we have the Add-on for VMware https://splunkbase.splunk.com/app/3215/ in splunk which is no cost and it will install the VMware sourcetype parsers.

1. Click on Event Action > Extract Fields to start the wizard

2. Select Regular Expression > highlight to select a value > name the field > continue on to validation and complete the wizard.

When you click the events it will show all the events regarding the word you highlighted .

Useful Links:

https://splunkbase.splunk.com/app/3975/

Posted in logs, vCSA 6.0, VCSA6.5, VCSA6.7, VMware | Tagged , | Leave a comment

Bug noticed on VCSA 14367737 Syslog configuration.

We are running the VCSA 14367737 and it can’t be upgraded because we have the internal cloud stack on top of the vCenter and it supports only the VCSA version 14367737. I have tried forwarding the VCSA logs to the Syslog server ( SPLUNK ) and noticed after the configuration it worked for few hours and stopped working and we have to restart the service manually systemctl restart rsyslog to forward the logs again to the Splunk server.

After trying few options and in our test environment we have tried upgrading the VC to different version and noticed the issue got fixed on the vCenter Appliance 6.7 Update 3g (6.7.0.44000) 16046470.Eventough in the release notes they havent mentioned anything on this issue , it looks like they have upgraded the rsyslog version on ths VCSA version.

As the workaround we can configure the cron and restart the service for every two hours.

Posted in VCSA6.7, VMware | Tagged , , | 1 Comment

Packet drop issue on HP Gen 9 \ Gen 10 servers running ESXi6.7.

We have noticed the packet drop on all of our HP BL460c Gen 9 \ Gen 10 across the region which is running ESXi, 6.7.0, 16316930 and the Network adapter presently installed on the server is HPE FlexFabric 10Gb 2-port 536FLB Adapter which is Qlogic Adapter.

Version which comes with the HP custom image includes the qfle3  driver version 1.1.6.0-1OEM.650.0.0.4598673 and We have tried updating the driver \ firmware of the HP Enclosure \ OA \ Virtual Connect to the below versions but didn’t fix the issue.

OA Firmware : 4.96

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_8e583ffa28874a53aa272b959b

      3.. Upgrade the Virtual connect firmware on one switch and another switch.

HP Virtual Connect Firmware: 4.85

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_f99f0bc5bfc4414aac021f81af#tab3

Solution:

After a lot of options tried, HP has recommended installing the below driver version and packet drop issue has been fixed.

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_fca9a16a601345919247b0c240#tab-history.

[root@esx106:~] esxcli software vib install -v “/cp039955/QLogic-Network-iSCSI-FCoE-v2.0.102-14793946/QLogic-Network-iSCSI-FCoE-v2.0.102-offline_bundle-14793946/vib20/qfle3/QLC_bootbank_qfle3_1.0.87.0-1OEM.670.0.0.8169922.vib”

Installation Result

   Message: Host is not changed. Reboot is pending from previous transaction.

   Reboot Required: true

   VIBs Installed:

   VIBs Removed:

   VIBs Skipped: QLC_bootbank_qfle3_1.0.87.0-1OEM.670.0.0.8169922

I think as per the ESXi Patch advisory it is mentioned QFLE3 is 1.0.50.11-9vmw.670.0.0.8169922 so we need to have something near to the version and once installed the driver recommended by HP which is 1.0.87.0-1OEM.670.0.0.8169922 fixed the issue for us.

Posted in ESXi issue, ESXi Patches, HP, VMware | Tagged , , , , , , , | 4 Comments

Memories of 2020

We have started migrating our Tier-1 infrastructure to AWS and most of our internal applications have been moved to the cloud and learned lot of new AWS services and new technologies.

Around SEP, I was requested to support our internal cloud team which is running cloud-stack with VMware and after a very big gap again back to VMware and virtualization technology. Initially it was very difficult for the change over but now very much back in to the track.

After moving to the internal cloud team , I got the opportunity to take care of the Scrum master role and started doing the same and planning to finish the certification.

Even tough because of the pandemic there were lot of challenges , I had a good 2020 in my professional life and looking forward for the new year 2021..

Posted in Uncategorized | Tagged , , | Leave a comment

Easy way to uninstall the Trend Deep Security agent.

I was searching the easy way to uninstall the Trend Agent on the windows 10 and find the below command useful.

Get-Package -Name  “Trend Micro Deep Security Agent” | Uninstall-Package

Or

msiexec.exe /x <exact MSI package name>.msi /quiet

Reference:

https://success.trendmicro.com/solution/1055096-performing-silent-uninstallation-of-deep-security-agent-dsa-from-windows-machine

Posted in Trend Micro Deep Security, Trend Micro Deep Security 9.5 ( Deep Security Agent ) | Leave a comment

AWS Compute related updates

AWS End of Support Migration Program for Windows Server now available as a self-serve solution for customers

Resource Access Manager Support is now available on AWS Outposts

New course for Amazon Elastic Kubernetes Service

Amazon EKS now supports Kubernetes version 1.18

AWS Lambda Extensions: a new way to integrate Lambda with operational tools

AWS Compute Optimizer enhances EC2 instance type recommendations with Amazon EBS metrics

Amazon EBS CSI driver now supports AWS Outposts

Amazon ElastiCache on Outposts is now available

AWS Elastic Beanstalk Adds Support for Running Multi-Container Applications on AL2 based Docker Platform

AWS Batch introduces tag-based access control

Amazon EC2 G4dn Bare Metal Instances with NVIDIA T4 Tensor Core GPUs, now available in 15 additional regions

AWS Launch Wizard now supports SAP HANA backups with AWS Backint Agent

Posted in AWS, EC2 | Tagged | Leave a comment

Steps to blacklist the problematic DCs in VMware VCSA 6.7U3

We had a DNS issue in one of the DC running active directory integrated  DNS service and it caused our vCenter to fail to connect the domain in AD so we have changed the  DNS to the IPs which is working properly but identified still AD authentication getting failed and in the VAR\LOG\Messages it was still pointing to the problematic DC and failing to authenticate.

After a few research got the instruction from the VCSA6.7 U3b release notes about the steps to blacklist the DCs and added the problematic DC IP as mentioned below.

Active Directory authentication or joining a domain is slow

Active Directory authentication or joining a domain might be slow when configured with Integrated Windows Authentication (IWA), because of infrastructure issues such as network latency and firewalls in some of the domain controllers.

This issue is resolved in this release. The fix provides the option to blacklist selected domain controllers in case of infrastructure issues.

To set the option, use the following commands:
# /opt/likewise/bin/lwregshell set_value '[HKEY_THIS_MACHINE\Services\netlogon\Parameters]' BlacklistedDCs DC_IP1,DC_IP2,...
# /opt/likewise/bin/lwsm restart lwreg

To revert to the default settings, use the following commands:
# /opt/likewise/bin/lwregshell set_value '[HKEY_THIS_MACHINE\Services\netlogon\Parameters]' BlacklistedDCs ""
# /opt/likewise/bin/lwsm restart lwreg

But still we noticed the VC is connecting to the problematic DC and also in the file /var/lib/likewise/krb5-affinity.conf it was showing the problamatic DC IP and when we tried to change it manually , automatically it got updated to the old problamatic DC IP .

After research we added the VC subnet in Active Directory Sites and Services to the new DCs and waited for few mins and noticed in the krb5-affinity.conf the new DC IPs got updated and issue got fixed by pointing the VC to the correct DC and ignoring the problematic DC.

Note : BlacklistDCs will work only from the 6.7U3b version.

Useful links:

https://kb.vmware.com/s/article/2127213

https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.psc.doc/GUID-8C553435-27CD-4410-ACA9-9A84EA1D7334.html

https://kb.vmware.com/s/article/53698

https://docs.vmware.com/en/VMware-vSphere/6.7/rn/vsphere-vcenter-server-67u3b-release-notes.html

Posted in ESXi issue, Vcenter Appliance, vCSA 6.0, VCSA6.5, VCSA6.7, VMware | Tagged , , , | Leave a comment