Troubleshooting Node down Scenarios in Azure Service Fabric - Part III

Published May 20 2021 11:12 PM 1,231 Views
Microsoft

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here.

 

Scenario#6:

Check the Network connectivity between the nodes:

  • Open a command prompt
  • Ping <IP Address Of Other Node>

reshmav_0-1621564521804.png

If request times out.

Mitigation:

Check if any NSG blocking the connectivity.

 

Scenario#7:

Node-to-Node communication failure due to any of the below reason could lead to Node down issue.

  • If Cluster Certificate has expired.
  • If SF extension on the VMSS resource is pointing to expired certificate, On VM reboot node may go down due to this expired certificate.

"extensionProfile": {

                "extensions": [

                {

                    "properties": {

                    "autoUpgradeMinorVersion": true,

                    "settings": {

                        "clusterEndpoint": "https://xxxxx.servicefabric.azure.com/runtime/clusters/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",

                        "nodeTypeRef": "sys",

                        "dataPath": "D:\\\\SvcFab",

                        "durabilityLevel": "Bronze",

                        "enableParallelJobs": true,

                        "nicPrefixOverride": "10.0.0.0/24",

                        "certificate": {

                        "thumbprint": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",

                        "x509StoreName": "My"

                        }

 

  • Make sure certificate is ACL'd to network service.
  • If Reverse Proxy certificate has expired.
  • If above are taken care, Go to Scenario#8.

 

Scenario#8:

Node1 is not able to establish lease with a Neighboring node2 could cause node1 to do down.

From the SF traces:

For example in the logs we see a node with Node ID “e4eac25286f23859b79b5483964ab0c8” (Node1) failed to establish lease with a node with Node ID “c196867202638ea43655614031736e9” (Node2)–

reshmav_1-1621564521807.png

Now the focus should be on the node with which the lease connectivity is failing rather than the node which is down.

reshmav_5-1621564936418.png

From above traces, we get the Error code: c0000017

To understand what this Error code means, please download Microsoft Error Lookup Tool.

And execute the exe by passing error code as Parameter:

reshmav_6-1621564988198.png

 

Mitigation:

Restart the node (Node2) which could free up the Virtual Memory and start establishing the lease with Node1 to bring the Node1 Up.

 

 

%3CLINGO-SUB%20id%3D%22lingo-sub-2374341%22%20slang%3D%22en-US%22%3ETroubleshooting%20Node%20down%20Scenarios%20in%20Azure%20Service%20Fabric%20-%20Part%20III%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2374341%22%20slang%3D%22en-US%22%3E%3CP%3EThis%20is%20a%20continuation%20of%20Troubleshooting%20Node%20down%20Scenarios%20in%20Azure%20Service%20Fabric%20%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fazure-paas-blog%2Ftroubleshooting-node-down-scenarios-in-azure-service-fabric-part%2Fba-p%2F2374508%22%20target%3D%22_blank%22%3Ehere%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EScenario%236%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ECheck%20the%20Network%20connectivity%20between%20the%20nodes%3A%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EOpen%20a%20command%20prompt%3C%2FLI%3E%0A%3CLI%3EPing%20%3CIP%20address%3D%22%22%20of%3D%22%22%20other%3D%22%22%20node%3D%22%22%3E%3C%2FIP%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22reshmav_0-1621564521804.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F282498i03F0A39306CA13B6%2Fimage-size%2Fmedium%3Fv%3Dv2%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22reshmav_0-1621564521804.png%22%20alt%3D%22reshmav_0-1621564521804.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EIf%20request%20times%20out.%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EMitigation%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ECheck%20if%20any%20NSG%20blocking%20the%20connectivity.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EScenario%237%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ENode-to-Node%20communication%20failure%20due%20to%20any%20of%20the%20below%20reason%20could%20lead%20to%20Node%20down%20issue.%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EIf%20Cluster%20Certificate%20has%20expired.%3C%2FLI%3E%0A%3CLI%3EIf%20SF%20extension%20on%20the%20VMSS%20resource%20is%20pointing%20to%20expired%20certificate%2C%20On%20VM%20reboot%20node%20may%20go%20down%20due%20to%20this%20expired%20certificate.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%22extensionProfile%22%3A%20%7B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22extensions%22%3A%20%5B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%7B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22properties%22%3A%20%7B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22autoUpgradeMinorVersion%22%3A%20true%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22settings%22%3A%20%7B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22clusterEndpoint%22%3A%20%22%3CA%20href%3D%22https%3A%2F%2Fxxxxx.servicefabric.azure.com%2Fruntime%2Fclusters%2Fxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3Ehttps%3A%2F%2Fxxxxx.servicefabric.azure.com%2Fruntime%2Fclusters%2Fxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx%3C%2FA%3E%22%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22nodeTypeRef%22%3A%20%22sys%22%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22dataPath%22%3A%20%22D%3A%5C%5C%5C%5CSvcFab%22%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22durabilityLevel%22%3A%20%22Bronze%22%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22enableParallelJobs%22%3A%20true%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22nicPrefixOverride%22%3A%20%2210.0.0.0%2F24%22%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22certificate%22%3A%20%7B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%3CFONT%20color%3D%22%23FF0000%22%3E%22thumbprint%22%3A%20%22XXXXXXXXXXXXXXXXXXXXXXXXXXXXX%22%3C%2FFONT%3E%2C%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%22x509StoreName%22%3A%20%22My%22%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%7D%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3EMake%20sure%20certificate%20is%20ACL'd%20to%20network%20service.%3C%2FLI%3E%0A%3CLI%3EIf%20Reverse%20Proxy%20certificate%20has%20expired.%3C%2FLI%3E%0A%3CLI%3EIf%20above%20are%20taken%20care%2C%20Go%20to%20Scenario%238.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EScenario%238%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ENode1%20is%20not%20able%20to%20establish%20lease%20with%20a%20Neighboring%20node2%20could%20cause%20node1%20to%20do%20down.%3C%2FP%3E%0A%3CP%3EFrom%20the%20SF%20traces%3A%3C%2FP%3E%0A%3CP%3EFor%20example%20in%20the%20logs%20we%20see%20a%20node%20with%20Node%20ID%20%E2%80%9Ce4eac25286f23859b79b5483964ab0c8%E2%80%9D%20(Node1)%20failed%20to%20establish%20lease%20with%20a%20node%20with%20Node%20ID%20%E2%80%9Cc196867202638ea43655614031736e9%E2%80%9D%20(Node2)%E2%80%93%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22reshmav_1-1621564521807.png%22%20style%3D%22width%3A%20633px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F282499iA3B7507AA90BE6FD%2Fimage-dimensions%2F633x35%3Fv%3Dv2%22%20width%3D%22633%22%20height%3D%2235%22%20role%3D%22button%22%20title%3D%22reshmav_1-1621564521807.png%22%20alt%3D%22reshmav_1-1621564521807.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3ENow%20the%20focus%20should%20be%20on%20the%20node%20with%20which%20the%20lease%20connectivity%20is%20failing%20rather%20than%20the%20node%20which%20is%20down.%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22reshmav_5-1621564936418.png%22%20style%3D%22width%3A%20626px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F282501iDCD22C53B77BD421%2Fimage-dimensions%2F626x108%3Fv%3Dv2%22%20width%3D%22626%22%20height%3D%22108%22%20role%3D%22button%22%20title%3D%22reshmav_5-1621564936418.png%22%20alt%3D%22reshmav_5-1621564936418.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EFrom%20above%20traces%2C%20we%20get%20the%20Error%20code%3A%20c0000017%3C%2FP%3E%0A%3CP%3ETo%20understand%20what%20this%20Error%20code%20means%2C%20please%20download%20%3CA%20href%3D%22https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fdownload%2Fdetails.aspx%3Fid%3D100432%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3EMicrosoft%20Error%20Lookup%20Tool%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3EAnd%20execute%20the%20exe%20by%20passing%20error%20code%20as%20Parameter%3A%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22reshmav_6-1621564988198.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F282502i31399A02550D2947%2Fimage-size%2Fmedium%3Fv%3Dv2%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22reshmav_6-1621564988198.png%22%20alt%3D%22reshmav_6-1621564988198.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3EMitigation%3A%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ERestart%20the%20node%20(Node2)%20which%20could%20free%20up%20the%20Virtual%20Memory%20and%20start%20establishing%20the%20lease%20with%20Node1%20to%20bring%20the%20Node1%20Up.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E
Co-Authors
Version history
Last update:
‎May 20 2021 11:19 PM
Updated by: