Troubleshooting Node down Scenarios in Azure Service Fabric - Part II

Published May 20 2021 11:11 PM 2,025 Views
Microsoft

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here

 

Scenario#5:

Virtual Machine associated with the node is healthy, but Service Fabric Extension being unhealthy could cause node to go down in Service Fabric cluster.

Analysis:

RDP into node, which is down. Open Task manager and Observe the Fabric processes.

reshmav_1-1621576732767.png

If Fabric.exe and FabricHost.exe is crashing and Restarting often, then check Mitigation#1.

If ServiceFabricNodeBootStrapAgent.exe is crashing and Restarting often check Mitigation#2.

If FabricInstallerSvc.exe is crashing and Restarting often check Mitigation#3.

 

Mitigation#1:

  • <path>/Cluster.current.xml
  • Does it match manifest for cluster (compare with the one in SFX)
  • No
    • Does SFX indicate upgrades in progress?
  • No upgrades in progress
    • Go to  <Path>
    • Open Clustermanifest.current.xml
    • Replace contents of Clustermanifest.current with contents of manifest in SFX.
    • Save
    • In task manager, select Fabric.exe if running and click on "End Task" button
    • If Fabric.exe is not running, reboot VM.
    • It will take a few minutes for node to become healthy.
    • Node did not become healthy, start from beginning.                                               

Path: D:\SvcFab\_Nodename_\Fabric\ClusterManifest.current.xml

 

Mitigation#2:

Check if this process listed in list of processes in Task Manager.

  • If “Yes”:
    • Wait a while to see if the node heals itself.
    • This process tries to heal the failure at a coarse level by restarting the VM and reinstalling SF runtime.
    • It waits for 15 minutes after an attempt to heal before taking the next action.
    • Check ServiceFabricNodeBootstrapAgent.InstallLog – Check “From the Node”                                                                                     Path: C:\Packages\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\<version>\Service\ServiceFabricNodeBootstrapAgent.InstallLog
    • Did not heal, go to “Event Viewer logs” for error details.

 

  • If “No”:
    • Go to Services tab in Task Manager and click on Open Services link at the bottom.
    • Check the startup mode for the bootstrap service, make sure it is Automatic .
    • Start service.
    • If it stays running, go to "Yes" section above.

 

Mitigation#3:

Check if the connectivity of the Node is working.

For more details Refer to Part III - Troubleshooting Node down Scenarios.

Co-Authors
Version history
Last update:
‎May 20 2021 11:15 PM
Updated by: