Troubleshooting Node down Scenarios in Azure Service Fabric - Part III
Published May 20 2021 11:12 PM 2,560 Views
Microsoft

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here.

 

Scenario#6:

Check the Network connectivity between the nodes:

  • Open a command prompt
  • Ping <IP Address Of Other Node>

reshmav_0-1621564521804.png

If request times out.

Mitigation:

Check if any NSG blocking the connectivity.

 

Scenario#7:

Node-to-Node communication failure due to any of the below reason could lead to Node down issue.

  • If Cluster Certificate has expired.
  • If SF extension on the VMSS resource is pointing to expired certificate, On VM reboot node may go down due to this expired certificate.

"extensionProfile": {

                "extensions": [

                {

                    "properties": {

                    "autoUpgradeMinorVersion": true,

                    "settings": {

                        "clusterEndpoint": "https://xxxxx.servicefabric.azure.com/runtime/clusters/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",

                        "nodeTypeRef": "sys",

                        "dataPath": "D:\\\\SvcFab",

                        "durabilityLevel": "Bronze",

                        "enableParallelJobs": true,

                        "nicPrefixOverride": "10.0.0.0/24",

                        "certificate": {

                        "thumbprint": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",

                        "x509StoreName": "My"

                        }

 

  • Make sure certificate is ACL'd to network service.
  • If Reverse Proxy certificate has expired.
  • If above are taken care, Go to Scenario#8.

 

Scenario#8:

Node1 is not able to establish lease with a Neighboring node2 could cause node1 to do down.

From the SF traces:

For example in the logs we see a node with Node ID “e4eac25286f23859b79b5483964ab0c8” (Node1) failed to establish lease with a node with Node ID “c196867202638ea43655614031736e9” (Node2)–

reshmav_1-1621564521807.png

Now the focus should be on the node with which the lease connectivity is failing rather than the node which is down.

reshmav_5-1621564936418.png

From above traces, we get the Error code: c0000017

To understand what this Error code means, please download Microsoft Error Lookup Tool.

And execute the exe by passing error code as Parameter:

reshmav_6-1621564988198.png

 

Mitigation:

Restart the node (Node2) which could free up the Virtual Memory and start establishing the lease with Node1 to bring the Node1 Up.

 

 

Co-Authors
Version history
Last update:
‎May 20 2021 11:19 PM
Updated by: