Azure PaaS Blog

2 MIN READ

Troubleshooting Node down Scenarios in Azure Service Fabric - Part I

reshmav

Microsoft

May 05, 2021

Node may go down for several reasons, please find the probable causes for Nodes going down in Service Fabric Cluster.

Scenario#1:

Check the Virtual Machine associated with the Node exists or Deleted or Deallocated.

Azure Portal-> VMSS Resource -> Instances

If Virtual machine doesn’t exist, then one must perform either of below to Remove node state from Service Fabric cluster.

From SFX:

Go to the service fabric explorer of the cluster.
Check the Advanced mode setting check box on the cluster:

Then click on Ellipsis (…) of the down nodes to have the “Remove node state” options and click on it. This should remove node state from the cluster.

From PS Command:

PS cmd: Remove-ServiceFabricNodeState -NodeName _node_5 -Force

Reference: https://docs.microsoft.com/en-us/powershell/module/servicefabric/remove-servicefabricnodestate?view=azureservicefabricps

Scenario#2:

Check if Virtual machine associate with the node is healthy in VMSS.

Go to Azure Portal-> VMSS Resource -> Instances -> Click on the Instance -> Properties

If Virtual Machine Guest Agent is “Not Ready” then reach out to Azure VM Team for the RCA.

Possible Mitigation:

Restart the Virtual machine from VMSS blade.
Re-image the Virtual Machine.

Scenario#3:

Check the performance of the Virtual Machine-like CPU and Memory.

If the CPU or Memory is High, then Fabric related process will not be able to establish any instances/start the instances causing the node to go down.

Mitigation:

Check which process is consuming high CPU/Memory from the Task Manager to investigate the root cause and fix the issue permanently.

Collect the dumps using below tool to determine the root cause:

DebugDiag:

Download Debug Diagnostic Tool v2 Update 3 from Official Microsoft Download Center

(or) Procdump:

ProcDump - Windows Sysinternals | Microsoft Docs

Restart the Virtual machine from VMSS blade.

Scenario#4:

Check the Disk usage of the Virtual Machine, no space is the disk could lead to Node down issues.

For disk space related issues, we recommend to use ‘windirstat’ tool mentioned in the article: https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Cluster/Out%20of%20Diskspace.md to understand which folders are consuming more space.

Mitigation:

Free up the space to bring the Node Up.

Updated May 05, 2021

Version 1.0

Azure Service Fabric

reshmav

Microsoft

Joined December 03, 2019

View Profile

Azure PaaS Blog

Follow this blog board to get notified when there's new activity

Blog Post

Troubleshooting Node down Scenarios in Azure Service Fabric - Part I