Node may go down for several reasons, please find the probable causes for Nodes going down in Service Fabric Cluster.
Scenario#1:
Check the Virtual Machine associated with the Node exists or Deleted or Deallocated.
Azure Portal-> VMSS Resource -> Instances
If Virtual machine doesn’t exist, then one must perform either of below to Remove node state from Service Fabric cluster.
From SFX:
From PS Command:
PS cmd: Remove-ServiceFabricNodeState -NodeName _node_5 -Force
Scenario#2:
Check if Virtual machine associate with the node is healthy in VMSS.
Go to Azure Portal-> VMSS Resource -> Instances -> Click on the Instance -> Properties
If Virtual Machine Guest Agent is “Not Ready” then reach out to Azure VM Team for the RCA.
Possible Mitigation:
Scenario#3:
Check the performance of the Virtual Machine-like CPU and Memory.
If the CPU or Memory is High, then Fabric related process will not be able to establish any instances/start the instances causing the node to go down.
Mitigation:
Collect the dumps using below tool to determine the root cause:
DebugDiag:
Download Debug Diagnostic Tool v2 Update 3 from Official Microsoft Download Center
(or) Procdump:
ProcDump - Windows Sysinternals | Microsoft Docs
Scenario#4:
Check the Disk usage of the Virtual Machine, no space is the disk could lead to Node down issues.
For disk space related issues, we recommend to use ‘windirstat’ tool mentioned in the article: https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Cluster/Out%20of%20Disksp... to understand which folders are consuming more space.
Mitigation:
Free up the space to bring the Node Up.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.