Time by time, when you use Service Fabric cluster, the cluster may meet different issues and in the reported error/warning message, it’s marked that one or some specific nodes do not have enough disk space. This may be caused by different reasons. This blog will talk about the common solution of this issue.
There will be lots of possible root causes for the not enough disk space issue. In this blog, we'll mainly talk about following four:
To better identify which one is matching your own scenario, please kindly check the following description of them:
Then let's talk about the solutions of the above four kinds of issue.
1. To reconfigure the size limit of the diagnostic log files, we need to open a PowerShell command window with Az module installed. Please refer to the official document for how to install. After login successfully, we can use the following command to set the expected size limit.
Set-AzServiceFabricSetting -ResourceGroupName SF-normal -Name sfhttpjerry -Section Diagnostics -Parameter MaxDiskQuotaInMB -Value 25600
Please remember to replace the resource group name, service fabric cluster name and the number for size limit by yourself before running command.
Once this command is run successfully, it may not take effect immediately. The Service Fabric cluster will scan size of the diagnostic log periodically. We need to wait until next scan is triggered. Once it's triggered and if the size of the diagnostic log files are bigger than your setting number (25600MB = 20 GB in my example), cluster will automatically delete some log files to release more disk space.
2. To change the path of paging file, we can follow these steps to switch.
paging file configuration change
3. To clean up the application package, this is easy to do in SFX. Once we visit SFX, go to the same Image Store page as how we check the issue for this kind of root cause. Then on the left side, there will be a menu to delete the unneeded package.
How to delete package in SFX
After typing in the name in confirmation window and select Delete Image Store content, cluster will automatically delete the unneeded application on every node.
Confirmation window to delete package
4. For the issue caused by too many images, we can configure the cluster to automatically deleted the unused image. The detailed configuration can be found in this document and the way about how to update cluster configuration will be as following:
a. Visit Azure Resource Explorer with Read/Write mode, login and find the Service Fabric cluster.
b. Click Edit button and modify the json format cluster configuration as expected. In this solution, it will be to add some configuration into fabricSettings part.
c. Send out the request to save the new configuration by clicking the green PUT button and wait until the provisioning status of this cluster becomes Succeeded.
To make this solution work, there is one more thing which we need to do is to unregister all unnecessary and unused applications. This can also be done by the command documented here. Since the parameter ApplicationTypeName and ApplicationTypeVersion are both required for this command, that means we can only unregister one version of an application type after running the command once. But since maybe you may have very many versions and many application types, here are 2 following possible ways:
$apptypes = Get-ServiceFabricApplicationType
$apps = Get-ServiceFabricApplication
$using = $false
foreach ($apptype in $apptypes)
{
$using = $false
foreach ($app in $apps)
{
if ($apptype.ApplicationTypeName -eq $app.ApplicationTypeName -and $apptype.ApplicationTypeVersion -eq $app.ApplicationTypeVersion)
{
$using = $true
break
}
}
if ($using -eq $false)
{
Unregister-ServiceFabricApplicationType -ApplicationTypeName $apptype.ApplicationTypeName -ApplicationTypeVersion $apptype.ApplicationTypeVersion -Force
}
}
In additional to the above four possible causes and solutions, there are three more possible solutions for the "Not enough disk space" issue. The following is the explanation.
Sometimes scaling out, which means increasing the number of nodes of Service Fabric cluster will also help us to mitigate the disk full issue. This operation will not only be useful to improve the CPU and memory usage, but will also auto-balance the distribution of the services among nodes to improve the disk usage. When using the Silver or higher durability, to scale out the VMSS instances, we can scale out the nodes number in the VMSS directly.
It is easy to understand this point. Since the issue is about the full disk, we can simply change the VM sku to a bigger size to have bigger disk space. But please kindly check all above solutions at first to make sure everything is reasonable and we do really need more disk size to handle more data. For example, if our application is with stateful services and the full disk happens due to our stateful services save too much data, then we should consider about improving the code logic but not scaling out the VMSS at first. Otherwise, with bigger VM sku, the issue will still reproduce sooner or later.
To scale up the VMSS, we can following two ways:
Please be careful that the ReplicatorLog is not saving any kind of log file. It's saving some important data of both Service Fabric cluster and application. Delete this folder will possibly cause data loss. And the size of this folder is fixed to the configured size, by default 8 GB. It will always be the same size no matter how much data is saved.
It's NOT recommended to modify this setting. You should only do it if you absolute have to do so. It may run the risk of data loss. This should only be done if absolutely required.
For the ReplicatorLog size, as mentioned above, the key point is to add a customized ktlLogger setting into the Service Fabric cluster. To do that, we need to:
a. Visit Azure Resource Explorer with Read/Write mode, login and find the Service Fabric cluster.
b. Add the ktlLogger setting into fabricSettings part. The expected expression will be such as following:
{
"name": "KtlLogger",
"parameters": [{
"name": "SharedLogSizeInMB",
"value": "4096"
}]
}
c. Send out the request to save the new configuration by clicking the green PUT button and wait until the provisioning status of this cluster becomes Succeeded.
d. Visit SFX and check the status to make sure everything is in healthy state.
e. Open a PowerShell command window from a computer where the cluster certificate is installed. If the Service Fabric module is not installed yet, please refer to our document to install at first. Then run following command to connect to the Service Fabric cluster. Here the thumbprint is the one of the cluster certificate and also remember to replace the cluster name by correct URL.
$ClusterName= "xxx.australiaeast.cloudapp.azure.com:19000"
$CertThumbprint= "7279972D160AB4C3CBxxxxx34EA2BCFDFAC2B42"
Connect-serviceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 -X509Credential -ServerCertThumbprint $CertThumbprint -FindType FindByThumbprint -FindValue $CertThumbprint -StoreLocation CurrentUser -StoreName My
f. Use command to disable one node from Service Fabric cluster. (_nodetype1_0 in example code)
Disable-ServiceFabricNode -NodeName "_nodetype1_0" -Intent RemoveData -Force
g. Monitor in SFX until the node included in last command is with status disabled.
disabled node in SFX
h. RDP into this node and manually delete the D:\SvcFab\ReplicatorLog folder. Attention! This operation will remove all logs in ReplicatorLog. Please double-confirm whether any context there is still needed before deletion.
i. Use following command to enable the disabled node. Monitor until the node is with status Up.
Enable-ServiceFabricNode -NodeName "_nodetype1_0"
j. Wait until everything is healthy in SFX and repeat step f to step i on every node. After that, the ReplicatorLog folder in node will be with new customized size.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.