Troubleshooting Node down Scenarios in Azure Service Fabric - Part I
Published May 05 2021 12:09 AM 2,786 Views
Microsoft

Node may go down for several reasons, please find the probable causes for Nodes going down in Service Fabric Cluster.

 

Scenario#1:

Check the Virtual Machine associated with the Node exists or Deleted or Deallocated.

Azure Portal-> VMSS Resource -> Instances

reshmav_4-1620197902798.png

If Virtual machine doesn’t exist, then one must perform either of below to Remove node state from Service Fabric cluster.

From SFX:

  • Go to the service fabric explorer of the cluster.
  • Check the Advanced mode setting check box on the cluster:

reshmav_1-1620197782764.png

  • Then click on Ellipsis (…) of the down nodes to have the “Remove node state” options and click on it. This should remove node state from the cluster. 

 

From PS Command:

PS cmd: Remove-ServiceFabricNodeState -NodeName _node_5 -Force

Reference: https://docs.microsoft.com/en-us/powershell/module/servicefabric/remove-servicefabricnodestate?view=...

 

Scenario#2:

Check if Virtual machine associate with the node is healthy in VMSS.

Go to Azure Portal-> VMSS Resource -> Instances -> Click on the Instance -> Properties

reshmav_5-1620197937089.png

If Virtual Machine Guest Agent is “Not Ready” then reach out to Azure VM Team for the RCA.

 

Possible Mitigation:

  • Restart the Virtual machine from VMSS blade.
  • Re-image the Virtual Machine.

 

Scenario#3:

Check the performance of the Virtual Machine-like CPU and Memory.

reshmav_3-1620197782789.png

 

If the CPU or Memory is High, then Fabric related process will not be able to establish any instances/start the instances causing the node to go down.

 

Mitigation:

  • Check which process is consuming high CPU/Memory from the Task Manager to investigate the root cause and fix the issue permanently.

Collect the dumps using below tool to determine the root cause:

DebugDiag:

Download Debug Diagnostic Tool v2 Update 3 from Official Microsoft Download Center

 

(or) Procdump:

ProcDump - Windows Sysinternals | Microsoft Docs

  • Restart the Virtual machine from VMSS blade.

 

Scenario#4:

Check the Disk usage of the Virtual Machine, no space is the disk could lead to Node down issues.

For disk space related issues, we recommend to use ‘windirstat’ tool mentioned in the article: https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Cluster/Out%20of%20Disksp... to understand which folders are consuming more space.

 

Mitigation:

Free up the space to bring the Node Up.

 

Co-Authors
Version history
Last update:
‎May 05 2021 12:09 AM
Updated by: