Blog Post

Azure PaaS Blog
2 MIN READ

Troubleshooting Node down Scenarios in Azure Service Fabric - Part III

reshmav's avatar
reshmav
Icon for Microsoft rankMicrosoft
May 21, 2021

This is a continuation of Troubleshooting Node down Scenarios in Azure Service Fabric here.

 

Scenario#6:

Check the Network connectivity between the nodes:

  • Open a command prompt
  • Ping <IP Address Of Other Node>

If request times out.

Mitigation:

Check if any NSG blocking the connectivity.

 

Scenario#7:

Node-to-Node communication failure due to any of the below reason could lead to Node down issue.

  • If Cluster Certificate has expired.
  • If SF extension on the VMSS resource is pointing to expired certificate, On VM reboot node may go down due to this expired certificate.

"extensionProfile": {

                "extensions": [

                {

                    "properties": {

                    "autoUpgradeMinorVersion": true,

                    "settings": {

                        "clusterEndpoint": "https://xxxxx.servicefabric.azure.com/runtime/clusters/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",

                        "nodeTypeRef": "sys",

                        "dataPath": "D:\\\\SvcFab",

                        "durabilityLevel": "Bronze",

                        "enableParallelJobs": true,

                        "nicPrefixOverride": "10.0.0.0/24",

                        "certificate": {

                        "thumbprint": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",

                        "x509StoreName": "My"

                        }

 

  • Make sure certificate is ACL'd to network service.
  • If Reverse Proxy certificate has expired.
  • If above are taken care, Go to Scenario#8.

 

Scenario#8:

Node1 is not able to establish lease with a Neighboring node2 could cause node1 to do down.

From the SF traces:

For example in the logs we see a node with Node ID “e4eac25286f23859b79b5483964ab0c8” (Node1) failed to establish lease with a node with Node ID “c196867202638ea43655614031736e9” (Node2)–

Now the focus should be on the node with which the lease connectivity is failing rather than the node which is down.

From above traces, we get the Error code: c0000017

To understand what this Error code means, please download Microsoft Error Lookup Tool.

And execute the exe by passing error code as Parameter:

 

Mitigation:

Restart the node (Node2) which could free up the Virtual Memory and start establishing the lease with Node1 to bring the Node1 Up.

 

 

Updated May 21, 2021
Version 2.0
No CommentsBe the first to comment