This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here
Scenario#5:
Virtual Machine associated with the node is healthy, but Service Fabric Extension being unhealthy could cause node to go down in Service Fabric cluster.
Analysis:
RDP into node, which is down. Open Task manager and Observe the Fabric processes.
If Fabric.exe and FabricHost.exe is crashing and Restarting often, then check Mitigation#1.
If ServiceFabricNodeBootStrapAgent.exe is crashing and Restarting often check Mitigation#2.
If FabricInstallerSvc.exe is crashing and Restarting often check Mitigation#3.
Mitigation#1:
- <path>/Cluster.current.xml
- Does it match manifest for cluster (compare with the one in SFX)
- No
- Does SFX indicate upgrades in progress?
- No upgrades in progress
- Go to <Path>
- Open Clustermanifest.current.xml
- Replace contents of Clustermanifest.current with contents of manifest in SFX.
- Save
- In task manager, select Fabric.exe if running and click on "End Task" button
- If Fabric.exe is not running, reboot VM.
- It will take a few minutes for node to become healthy.
- Node did not become healthy, start from beginning.
Path: D:\SvcFab\_Nodename_\Fabric\ClusterManifest.current.xml
Mitigation#2:
Check if this process listed in list of processes in Task Manager.
- If “Yes”:
- Wait a while to see if the node heals itself.
- This process tries to heal the failure at a coarse level by restarting the VM and reinstalling SF runtime.
- It waits for 15 minutes after an attempt to heal before taking the next action.
- Check ServiceFabricNodeBootstrapAgent.InstallLog – Check “From the Node” Path: C:\Packages\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\<version>\Service\ServiceFabricNodeBootstrapAgent.InstallLog
- Did not heal, go to “Event Viewer logs” for error details.
- If “No”:
- Go to Services tab in Task Manager and click on Open Services link at the bottom.
- Check the startup mode for the bootstrap service, make sure it is Automatic .
- Start service.
- If it stays running, go to "Yes" section above.
Mitigation#3:
Check if the connectivity of the Node is working.
For more details Refer to Part III - Troubleshooting Node down Scenarios.