Troubleshooting Node down Scenarios in Azure Service Fabric - Part II

reshmav

Microsoft

May 21, 2021

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here

Scenario#5:

Virtual Machine associated with the node is healthy, but Service Fabric Extension being unhealthy could cause node to go down in Service Fabric cluster.

Analysis:

RDP into node, which is down. Open Task manager and Observe the Fabric processes.

If Fabric.exe and FabricHost.exe is crashing and Restarting often, then check Mitigation#1.

If ServiceFabricNodeBootStrapAgent.exe is crashing and Restarting often check Mitigation#2.

If FabricInstallerSvc.exe is crashing and Restarting often check Mitigation#3.

Mitigation#1:

<path>/Cluster.current.xml
Does it match manifest for cluster (compare with the one in SFX)
No
- Does SFX indicate upgrades in progress?
No upgrades in progress
- Go to <Path>
- Open Clustermanifest.current.xml
- Replace contents of Clustermanifest.current with contents of manifest in SFX.
- Save
- In task manager, select Fabric.exe if running and click on "End Task" button
- If Fabric.exe is not running, reboot VM.
- It will take a few minutes for node to become healthy.
- Node did not become healthy, start from beginning.

Path: D:\SvcFab\_Nodename_\Fabric\ClusterManifest.current.xml

Mitigation#2:

Check if this process listed in list of processes in Task Manager.

If “Yes”:
- Wait a while to see if the node heals itself.
- This process tries to heal the failure at a coarse level by restarting the VM and reinstalling SF runtime.
- It waits for 15 minutes after an attempt to heal before taking the next action.
- Check ServiceFabricNodeBootstrapAgent.InstallLog – Check “From the Node” Path: C:\Packages\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\<version>\Service\ServiceFabricNodeBootstrapAgent.InstallLog
- Did not heal, go to “Event Viewer logs” for error details.

If “No”:
- Go to Services tab in Task Manager and click on Open Services link at the bottom.
- Check the startup mode for the bootstrap service, make sure it is Automatic .
- Start service.
- If it stays running, go to "Yes" section above.

Mitigation#3:

Check if the connectivity of the Node is working.

For more details Refer to Part III - Troubleshooting Node down Scenarios.

Updated May 21, 2021

Version 2.0

reshmav

Microsoft

Joined December 03, 2019

View Profile

Azure PaaS Blog

Follow this blog board to get notified when there's new activity