NFS4.1 datastore disconnected after the Netapp storage upgrade\failover.

Update Jun2024: Looks like issue has been fixed in Update 3f, pls check this blog.

we have NFS-3 and NFS 4.1 presented to this ESXI hosts both are from the same Netapp storage. As part of the Netapp Storage upgrade failover was performed in the storage end and experienced some APDs error on the hosts, we observed only issues with NFS.41 datastore and NFS-3 datastore in are connected state without any issue.

Error on the host end.

“cpu0:2098542)StorageApdHandlerEv: 110: Device or filesystem with identifier [] has entered the All Paths Down state.2023-06-10T10:52:58.189Z cpu0:)StorageApdHandlerEv: 110: Device or filesystem with identifier [] has entered the All Paths Down state.2023-06-10T10:55:18.193Z cpu0:2098542)StorageApdHandlerEv: 126: Device or filesystem with identifier [] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will no$

We involved Netapp to check the storage log and below are the findings.

When LIF is migrated to another node, storage will send RST to close existing connection and send GARP to update mac address of the port where LIF is moved, so the client can initiate new connection to LIF over that mac.

Trace review

Storage sent RST to close the connection and sent GARP to update new MAC

Client uses new MAC and initiated TCP connection to storage, and it is successful

Client sent Bind_conn_to_session call and storage response is NFS4ERR_BADSESSION. Once that error is received, clients should send Exchange ID and initiate new session with storage.

When the client sends Exchange ID, the storage response is NFS4ERR_DELAY. When the NFS4ERR_DELAY response is received on the client, the client will retry the same operation.

Later, we could see Exchange ID operation is responded to successfully by storage and the client Acknowledged the response.

After that, the client is not sending calls to initiate the session and only sends TCP keep-alive calls. Same behavior is seen when LIF is reverted back to the node.From the traces it clearly shows that after acknowledging the response the client is not able to send further calls to initiate the session.

Vmware is involved to check further in the logs based on the Netapp input.

From the logs, once NFS41 client gets the EXiD from the server, it is comparing the cluster roles. NFS41 client is expecting 393, but got 655, so it is bailing out there.

WARNING: NFS41: NFS41ExidNFSProcess:2054: EXCHANGE_ID error: NFS4ERR_DELAY
2023-06-21T14:27:44.755Z cpu4:2099419)WARNING: NFS41: NFS41ExidNFSProcess:2054: EXCHANGE_ID error: NFS4ERR_DELAY

As per VMware it looks like the NFS server role got changed from 0x60**** before upgrade to 0x10*** after upgrade. Vmkernel.log doesn’t have the original server logs that shows the role before upgrade may be they rolled over. If the server role got changed, our NFS41 client doesn’t re-initiate the sessions but it expects to un-mount and remount the datastores.In the captured logs prior to the upgrade the role flags as 0x00060*** which is (EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS), after the upgrade we got these role flags as 0x00010**** which is EXCHGID4_FLAG_USE_NON_PNFS.

Further investigation VmWare explained the NFS41 client behavior WRT server role change.

– pNFS settings was changed in on NetApp Server / volume level (from _USE_PNFS_MDS|_USE_PNFS_DS to USE_NON_PNFS)
– Later there is NetApp server switchover as part of NetApp Server upgrade, some of the ESXis stopped the Session continuation after the new connection is up

From the RFC5661 Section 13.1 the server roles are agreed upon during the EXCHANGE_ID operation, after this exchange if the server changes the roles there is no protocol operation to inform this role change to the client. This role mismatch can be a problem because the client is assuming the older role and the server is assuming the newer role.

In this case, after NetApp Server upgrade the connection reset happened and when the ESXi client re-establishes the connection it got new role in the EXCHANGE_ID for the existing data store. Our current client implementation doesn’t support the role change for already mounted datastores as this involves releasing the older resources and setting up the new resources. So rejected this new role and stopped progressing with the Session establishment.

Since there is no protocol support for the dynamic change of server role, the VMware recommended method is to un-mount and remount the datastore after server role changes. Which helps us setup the appropriate context based on the server role, then any NetApp Server upgrades will be seamless without any disruption.

The only workaround is to reboot the host which will reinitiate the connection and connect the datastore back to the hosts.

This entry was posted in Dell, ESXi issue, Storage, VMware and tagged , , , . Bookmark the permalink.

1 Response to NFS4.1 datastore disconnected after the Netapp storage upgrade\failover.

  1. Pingback: NFS4 Slowness and Vcenter update 3f fixes the issue. | Techbrainblog

Leave a comment