Hi guys, I have just been reading the May 2018 patch notes for WS2016 and noticed these two points:
Improves resiliency in handling network issues that may cause highly available VMs to be turned off because of I/O timeouts or Cluster Shared Volumes dismounted messages.
Addresses an issue that causes the Drain Manager Cluster service to sometimes be stuck in the draining state.
We have been having issues ever since a network storm when we drain nodes or the Cluster Volumes move between the nodes. These issues can include the disks getting stuck in "Online Pending" for close to 5 minutes which kills the Virtual Machines running on the affected volume.
We have installed the latest updates on the nodes (bar the May 2018 patches) and updated the firmware on the SAN and SAN disks. When we did this though we ended up with a few corrupt Cluster Volumes which we managed to get back using chkdsk. Even after the chkdsk the volumes took over 5 minutes to go fully online from "Online Pending".