Troubleshooting Node down Scenarios in Azure Service Fabric - Part II

Published 05-20-2021 11:11 PM 769 Views
Microsoft

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here

 

Scenario#5:

Virtual Machine associated with the node is healthy, but Service Fabric Extension being unhealthy could cause node to go down in Service Fabric cluster.

Analysis:

RDP into node, which is down. Open Task manager and Observe the Fabric processes.

reshmav_1-1621576732767.png

If Fabric.exe and FabricHost.exe is crashing and Restarting often, then check Mitigation#1.

If ServiceFabricNodeBootStrapAgent.exe is crashing and Restarting often check Mitigation#2.

If FabricInstallerSvc.exe is crashing and Restarting often check Mitigation#3.

 

Mitigation#1:

  • <path>/Cluster.current.xml
  • Does it match manifest for cluster (compare with the one in SFX)
  • No
    • Does SFX indicate upgrades in progress?
  • No upgrades in progress
    • Go to  <Path>
    • Open Clustermanifest.current.xml
    • Replace contents of Clustermanifest.current with contents of manifest in SFX.
    • Save
    • In task manager, select Fabric.exe if running and click on "End Task" button
    • If Fabric.exe is not running, reboot VM.
    • It will take a few minutes for node to become healthy.
    • Node did not become healthy, start from beginning.                                               

Path: D:\SvcFab\_Nodename_\Fabric\ClusterManifest.current.xml

 

Mitigation#2:

Check if this process listed in list of processes in Task Manager.

  • If “Yes”:
    • Wait a while to see if the node heals itself.
    • This process tries to heal the failure at a coarse level by restarting the VM and reinstalling SF runtime.
    • It waits for 15 minutes after an attempt to heal before taking the next action.
    • Check ServiceFabricNodeBootstrapAgent.InstallLog – Check “From the Node”                                                                                     Path: C:\Packages\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\<version>\Service\ServiceFabricNodeBootstrapAgent.InstallLog
    • Did not heal, go to “Event Viewer logs” for error details.

 

  • If “No”:
    • Go to Services tab in Task Manager and click on Open Services link at the bottom.
    • Check the startup mode for the bootstrap service, make sure it is Automatic .
    • Start service.
    • If it stays running, go to "Yes" section above.

 

Mitigation#3:

Check if the connectivity of the Node is working.

For more details Refer to Part III - Troubleshooting Node down Scenarios.

%3CLINGO-SUB%20id%3D%22lingo-sub-2374508%22%20slang%3D%22en-US%22%3ETroubleshooting%20Node%20down%20Scenarios%20in%20Azure%20Service%20Fabric%20-%20Part%20II%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2374508%22%20slang%3D%22en-US%22%3E%3CP%3EThis%20is%20a%20continuation%20of%20Troubleshooting%20Node%20down%20Scenarios%20in%20Azure%20Service%20Fabric%20%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fazure-paas-blog%2Ftroubleshooting-node-down-scenarios-in-azure-service-fabric-part%2Fba-p%2F2324973%22%20target%3D%22_blank%22%3Ehere%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EScenario%235%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3EVirtual%20Machine%20associated%20with%20the%20node%20is%20healthy%2C%20but%20Service%20Fabric%20Extension%20being%20unhealthy%20could%20cause%20node%20to%20go%20down%20in%20Service%20Fabric%20cluster.%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EAnalysis%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ERDP%20into%20node%2C%20which%20is%20down.%20Open%20Task%20manager%20and%20Observe%20the%20Fabric%20processes.%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22reshmav_1-1621576732767.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F282534i80CCE7F0C687BAD4%2Fimage-size%2Fmedium%3Fv%3Dv2%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22reshmav_1-1621576732767.png%22%20alt%3D%22reshmav_1-1621576732767.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EIf%20%3CSTRONG%3EFabric.exe%3C%2FSTRONG%3E%20and%20%3CSTRONG%3EFabricHost.exe%3C%2FSTRONG%3E%20is%20crashing%20and%20Restarting%20often%2C%20then%20check%20Mitigation%231.%3C%2FP%3E%0A%3CP%3EIf%20%3CSTRONG%3EServiceFabricNodeBootStrapAgent.exe%20%3C%2FSTRONG%3Eis%20crashing%20and%20Restarting%20often%20check%20Mitigation%232.%3C%2FP%3E%0A%3CP%3EIf%20%3CSTRONG%3EFabricInstallerSvc.exe%3C%2FSTRONG%3E%20is%20crashing%20and%20Restarting%20often%20check%20Mitigation%233.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EMitigation%231%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CPATH%3E%2FCluster.current.xml%3C%2FPATH%3E%3C%2FLI%3E%0A%3CLI%3EDoes%20it%20match%20manifest%20for%20cluster%20(compare%20with%20the%20one%20in%20SFX)%3C%2FLI%3E%0A%3CLI%3ENo%3CUL%3E%0A%3CLI%3EDoes%20SFX%20indicate%20upgrades%20in%20progress%3F%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3ENo%20upgrades%20in%20progress%3CUL%3E%0A%3CLI%3EGo%20to%26nbsp%3B%20%3CPATH%3E%3C%2FPATH%3E%3C%2FLI%3E%0A%3CLI%3EOpen%20Clustermanifest.current.xml%3C%2FLI%3E%0A%3CLI%3EReplace%20contents%20of%20Clustermanifest.current%20with%20contents%20of%20manifest%20in%20SFX.%3C%2FLI%3E%0A%3CLI%3ESave%3C%2FLI%3E%0A%3CLI%3EIn%20task%20manager%2C%20select%20Fabric.exe%20if%20running%20and%20click%20on%20%22End%20Task%22%20button%3C%2FLI%3E%0A%3CLI%3EIf%20Fabric.exe%20is%20not%20running%2C%20reboot%20VM.%3C%2FLI%3E%0A%3CLI%3EIt%20will%20take%20a%20few%20minutes%20for%20node%20to%20become%20healthy.%3C%2FLI%3E%0A%3CLI%3ENode%20did%20not%20become%20healthy%2C%20start%20from%20beginning.%26nbsp%3B%20%26nbsp%3B%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%26nbsp%3B%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3EPath%3A%20D%3A%5CSvcFab%5C_Nodename_%5CFabric%5CClusterManifest.current.xml%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EMitigation%232%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ECheck%20if%20this%20process%20listed%20in%20list%20of%20processes%20in%20Task%20Manager.%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EIf%20%E2%80%9CYes%E2%80%9D%3A%3CUL%3E%0A%3CLI%3EWait%20a%20while%20to%20see%20if%20the%20node%20heals%20itself.%3C%2FLI%3E%0A%3CLI%3EThis%20process%20tries%20to%20heal%20the%20failure%20at%20a%20coarse%20level%20by%20restarting%20the%20VM%20and%20reinstalling%20SF%20runtime.%3C%2FLI%3E%0A%3CLI%3EIt%20waits%20for%2015%20minutes%20after%20an%20attempt%20to%20heal%20before%20taking%20the%20next%20action.%3C%2FLI%3E%0A%3CLI%3ECheck%20ServiceFabricNodeBootstrapAgent.InstallLog%20%E2%80%93%20Check%20%E2%80%9CFrom%20the%20Node%E2%80%9D%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3B%20%26nbsp%3BPath%3A%20C%3A%5CPackages%5CPlugins%5CMicrosoft.Azure.ServiceFabric.ServiceFabricNode%5C%3CVERSION%3E%5CService%5CServiceFabricNodeBootstrapAgent.InstallLog%3C%2FVERSION%3E%3C%2FLI%3E%0A%3CLI%3EDid%20not%20heal%2C%20go%20to%20%E2%80%9CEvent%20Viewer%20logs%E2%80%9D%20for%20error%20details.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EIf%20%E2%80%9CNo%E2%80%9D%3A%3CUL%3E%0A%3CLI%3EGo%20to%20Services%20tab%20in%20Task%20Manager%20and%20click%20on%20%3CSTRONG%3EOpen%20Services%3C%2FSTRONG%3E%20link%20at%20the%20bottom.%3C%2FLI%3E%0A%3CLI%3ECheck%20the%20startup%20mode%20for%20the%20bootstrap%20service%2C%20make%20sure%20it%20is%20Automatic%20.%3C%2FLI%3E%0A%3CLI%3EStart%20service.%3C%2FLI%3E%0A%3CLI%3EIf%20it%20stays%20running%2C%20go%20to%20%22Yes%22%20section%20above.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EMitigation%233%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ECheck%20if%20the%20connectivity%20of%20the%20Node%20is%20working.%3C%2FP%3E%0A%3CP%3EFor%20more%20details%20Refer%20to%20%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fazure-paas-blog%2Ftroubleshooting-node-down-scenarios-in-azure-service-fabric-part%2Fba-p%2F2374341%22%20target%3D%22_blank%22%3EPart%20III%20-%20Troubleshooting%20Node%20down%20Scenarios%3C%2FA%3E.%3C%2FP%3E%3C%2FLINGO-BODY%3E
Co-Authors
Version history
Last update:
‎May 20 2021 11:15 PM
Updated by: