We have started migrating our Tier-1 infrastructure to AWS and most of our internal applications have been moved to the cloud and learned lot of new AWS services and new technologies.
Around SEP, I was requested to support our internal cloud team which is running cloud-stack with VMware and after a very big gap again back to VMware and virtualization technology. Initially it was very difficult for the change over but now very much back in to the track.
After moving to the internal cloud team , I got the opportunity to take care of the Scrum master role and started doing the same and planning to finish the certification.
Even tough because of the pandemic there were lot of challenges , I had a good 2020 in my professional life and looking forward for the new year 2021..
We had a DNS issue in one of the DC running active directory integrated DNS service and it caused our vCenter to fail to connect the domain in AD so we have changed the DNS to the IPs which is working properly but identified still AD authentication getting failed and in the VAR\LOG\Messages it was still pointing to the problematic DC and failing to authenticate.
After a few research got the instruction from the VCSA6.7 U3b release notes about the steps to blacklist the DCs and added the problematic DC IP as mentioned below.
Active Directory authentication or joining a domain is slow
Active Directory authentication or joining a domain might be slow when configured with Integrated Windows Authentication (IWA), because of infrastructure issues such as network latency and firewalls in some of the domain controllers.
This issue is resolved in this release. The fix provides the option to blacklist selected domain controllers in case of infrastructure issues.
To set the option, use the following commands: # /opt/likewise/bin/lwregshell set_value '[HKEY_THIS_MACHINE\Services\netlogon\Parameters]' BlacklistedDCs DC_IP1,DC_IP2,...
# /opt/likewise/bin/lwsm restart lwreg
To revert to the default settings, use the following commands: # /opt/likewise/bin/lwregshell set_value '[HKEY_THIS_MACHINE\Services\netlogon\Parameters]' BlacklistedDCs ""
# /opt/likewise/bin/lwsm restart lwreg
But still we noticed the VC is connecting to the problematic DC and also in the file /var/lib/likewise/krb5-affinity.conf it was showing the problamatic DC IP and when we tried to change it manually , automatically it got updated to the old problamatic DC IP .
After research we added the VC subnet in Active Directory Sites and Services to the new DCs and waited for few mins and noticed in the krb5-affinity.conf the new DC IPs got updated and issue got fixed by pointing the VC to the correct DC and ignoring the problematic DC.
Note : BlacklistDCs will work only from the 6.7U3b version.
Recently had the chance to learn about the CFN-LINT tool which is a very useful tool to validate the CloudFormation template directly using the editor and it makes us create the CF without any error and secured.
We have a vcenter environment with around 500 ESXi hosts running on multiple clusters and for the past several weeks we had the issue of Vcetner down because of VPXD crash and the service will be in stopped status.
VMware support identified the VCDB growth is huge with high CPU and memory usage on the vCenter and they started to investigate the same.
[Analysis]
Most of the memory usage is contributed by the Events
Signature 7f46978acb30 (Vmomi::KeyAnyValue) has 81430551 instances taking 0xd0aaf2b8(3,500,864,184) bytes.
Signature 562dc2a74a90 (Vmomi::Primitive<std::string>) has 55823822 instances taking 0x50096420(1,342,792,736) bytes.
Signature 7f4696eb77d0 (Vim::Event::ManagedEntityEventArgument) has 21574134 instances taking 0x37913be0(932,264,928) bytes.
Signature 7f4696eb7870 (Vim::Event::DatacenterEventArgument) has 13307524 instances taking 0x2205bdc0(570,801,600) bytes.
Signature 7f4696eb7960 (Vim::Event::HostEventArgument) has 13285493 instances taking 0x21b98d98(565,808,536) bytes.
Signature 7f4696eb78c0 (Vim::Event::ComputeResourceEventArgument) has 13280205 instances taking 0x21ddefb8(568,192,952) bytes.
Signature 562dc2a831f0 (Vmomi::DataArray<Vmomi::KeyAnyValue>) has 13086149 instances taking 0x21532ee8(559,099,624) bytes.
Signature 7f4696eb7d20 (Vim::Event::EventEx) has 13084710 instances taking 0xc31d4ee0(3,273,477,856) bytes.
Signature 7f4696eb7aa0 (Vim::Event::AlarmEventArgument) has 12934883 instances taking 0x20af9168(548,376,936) bytes.
Signature 562dc2ae8730 (Vmomi::Primitive<int>) has 12863119 instances taking 0x12707908(309,360,904) bytes.
Signature 7f4696eb4080 (Vim::Event::AlarmActionTriggeredEvent) has 4301050 instances taking 0x2b89b560(730,445,152) bytes.
Signature 7f4696eb4170 (Vim::Event::AlarmSnmpCompletedEvent) has 4301049 instances taking 0x272dbf98(657,309,592) bytes.
VCDB=# SELECT nspname || ‘.’ || relname AS “relation”, pg_size_pretty(pg_total_relation_size(C.oid)) AS “total_size” FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname NOT IN (‘pg_catalog’, ‘information_schema’) AND C.relkind <> ‘i’ AND nspname !~ ‘^pg_toast’ ORDER BY pg_total_relation_size(C.oid) DESC LIMIT 20;
As per VMware it is known issue with 6.7 causes the burst of events causing the high IO on the vCenter.
You could update all the ESXi to 6.7 P01 or the latest version to fix the issue or follow the workaround mentioned in the KB.
We have migrated one of our EC2 instance from T2 to T3 and found there is a lack of performance on the server and when we had a call with AWS support they shared that the T3 type is based on the nitro hardware and it required drivers to support a nitro based hardware.
Downloaded and updated the network (ENA) and Storage (NVMe) along with the PV drivers on that instance.
I took the image from the parent instance in which the key is working by using the .pem file but the instance which was created from the image the local administrator password is not working and it is failing with the below error.
After the investigation, I found that whenever we launch a new Windows instance using an Amazon-provided AMI, EC2Launch service is configured to generate a random password from the console.
However, after we launch the instance, this setting will be disabled on EC2Launch and you will need to enable it before creating a custom AMI. If this setting is not enabled on EC2Launch before creating the AMI, you won’t be able to retrieve the Password from the console and the same password of the source/parent instance would need to be used to access the new instance. Ideally, we should be able to login to this instance using the local Administrator password which was captured from the parent instance during the creation of the AMI.
As we are not able to login using the local administrator password of the parent AMI, we can use EC2-rescue tool to set random password and retrieve random generated password from the EC2 console using Key-pair for this instance.
Below Steps were shared from the Amazon support:
=========
Please follow steps to use EC2 Rescue to set password:
To troubleshoot this issue we used EC2Rescue tool and followed below steps:
[1] Launch helper instance from a Windows AWS Public AMI in the same VPC and subnet in which your current instance is launched.
[2] Detach the root volume of the instance and attach it to the helper instance as the secondary volume.
[3] Now, login to your helper Instance via RDP.
[4] Please download EC2Rescue tool by using the below link on the helper instance.
[5] Open EC2Rescue application -> Select the offline drive -> Click Diagnose and Rescue -> Select Ec2SetPassword
[6] Next, then Rescue, and OK for the volume to be offline.
[7] Once EC2Rescue has completed, detach the volume ( vol-0404dea9693705b73 ) from the helper instance and re-attach the volume back to the original instance ( i-0a41a2b8f29b4ec25 ) as /dev/sda1.
[8] Start the instance and Next, retrieve the password from the EC2 Console using your key-pair ( Metasys ) and connect to the instance.
Below links which have video included for the same steps.
Going ahead, I suggest that you SysPrep an instance before creating its AMI so that password access is enabled and you are able to retrieve console generated password.
Amazon FSx for Windows File Server provides a fully managed native Microsoft Windows file system. Amazon FSx provides NTFS file systems that can be accessed from up to thousands of compute instances using the SMB protocol (SMB 2.0 to 3.1.1). You can access your Amazon FSx file system from Amazon EC2, VMware Cloud on AWS, Amazon WorkSpaces, and Amazon AppStream 2.0 instances.
The service works with Microsoft Active Directory (AD) to integrate your file system with your existing Windows environments. Amazon FSx uses Distributed File System (DFS) Replication to support multi-AZ deployments. To ensure compatibility with your applications, Amazon FSx supports all Windows versions starting from Windows Server 2008 and Windows 7, and also current versions of Linux.
AWS FSx can be used for various application workloads like home directories, web serving & content management, enterprise applications, analytics, and media & entertainment. All Amazon FSx file system data is automatically encrypted at rest and in transit.
1.0 FSx for Windows file system
Prerequisites – Currently AWS FSx only work with AWS Managed AD [AWS MAD], AWS support AWS Directory Service’s AD sharing feature, AWS plan to support AD Connector and Self-managed Microsoft Active Directory.
These are the Directory Types options AWS have –
a) AWS Managed Microsoft AD
b) Simple AD
c) AD Connector
d) Cloud Directory
e) Amazon Cognito Your User Pools
No need to select any option from above.Self-managed Microsoft Active Directory is something we need to use for our use cases. Other options are little complicated and has some overhead to it to maintain as explained below.
a) Create AWS Managed Active Directory
We need to create AD (eg: xyz.storage.com). During AWS Managed Directory service creation, you will be asked for – Directory Type; Edition; Directory DNS name; Directory NetBIOS name; VPC; Subnet; AZ.
Limitation:
Here we need to maintain AD in respective account. So, this AD need a continues sync for latest AD objects.
Details explained here for the filesystem creation for this method
b) Simple AD
Simple AD is a standalone managed directory that is powered by a Linux-Samba Active Directory–compatible server. Not recommended as it needs to create a directory and object limitation
b) AD Connector
Need to create a trust relationship between source (app account ) and destination account (AD account).
–> Need to create AWS Managed AD directory ID
Limitation:
1) Overhead for this process and every account using this will need to have this trust relation.
2) Cost for active directory service created.
d) Cloud Directory : Similar to RDS cloud directory, Cloud Directory is a high-performance, serverless, hierarchical data store still same limitations as above mentioned for Simple AD.
e) Amazon Cognito Your User Pools : Directs to Cognito service for directory creation. No use
Share folder will comes by default and cannot be deleted.
2.0 Automatic Daily Backups
Amazon FSx automatically takes backups of your file systems once a day. These daily backups are taken during the daily backup window that was established when you created the file system. At some point during the daily backup window, storage I/O might be suspended briefly while the backup process initializes (typically under a few seconds). When you choose your daily backup window, we recommend that you choose a convenient time of the day outside of the normal operating hours for the applications that will use the file system. Backups are kept for a certain period of time, known as a retention period. By default, backups are retained for 7 days. However, you can change the retention period to anywhere in a range of 0–35 days.
We can perform backup creation and restoration from FSx Management Console, the AWS CLI, or one of the AWS SDKs
3.0 Multi-AZ File System Deployments
For workloads that require multi-AZ redundancy to tolerate temporary AZ unavailability, We can create multiple file systems in separate AZs, keep them in sync, and configure failover between them. Amazon FSx fully supports the use of the Microsoft Distributed File System (DFS) for file system deployments across multiple AZs to get Multi-AZ availability and durability. Using DFS Replication, you can automatically replicate data between two file systems. Using DFS Namespaces, you can configure one file system as your primary and the other as your standby, with automatic failover to the standby in the event that the primary becomes unresponsive. MS DFS support both async and sync replication.
AWS FSx provides high availability and failover support across multiple AZs which can be used for shared storage and also as mapped drive instead of EBS volumes as EBS cannot span Multi-AZ.
4.0 Benefits and Cons of FSx
Benefits:
· AWS FSx is fully managed. It relies on SSD storage and performs with high levels of IOPS and throughput, as well as consistent sub-millisecond latencies for a well-designed infra.
· AWS FSx is secure. All of the file systems are a part of the Virtual Private Cloud (VPC); all data is encrypted both in transit and at rest, and all activities are logged to CloudTrail
Cons:
· AWS FSx for windows File server supports custo DNS only Single-AZ filesystems, not for Multi-AZ as if yet