Azure Service Fabric
51 TopicsCommon causes of SSL/TLS connection issues and solutions
In the TLS connection common causes and troubleshooting guide (microsoft.com) and TLS connection common causes and troubleshooting guide (microsoft.com), the mechanism of establishing SSL/TLS and tools to troubleshoot SSL/TLS connection were introduced. In this article, I would like to introduce 3 common issues that may occur when establishing SSL/TLS connection and corresponding solutions for windows, Linux, .NET and Java. TLS version mismatch Cipher suite mismatch TLS certificate is not trusted TLS version mismatch Before we jump into solutions, let me introduce how TLS version is determined. As the dataflow introduced in the first session(https://techcommunity.microsoft.com/t5/azure-paas-blog/ssl-tls-connection-issue-troubleshooting-guide/ba-p/2108065), TLS connection is always started from client end, so it is client proposes a TLS version and server only finds out if server itself supports the client's TLS version. If the server supports the TLS version, then they can continue the conversation, if server does not support, the conversation is ended. Detection You may test with the tools introduced in this blog(TLS connection common causes and troubleshooting guide (microsoft.com)) to verify if TLS connection issue was caused by TLS version mismatch. If capturing network packet, you can also view TLS version specified in Client Hello. If connection terminated without Server Hello, it could be either TLS version mismatch or Ciphersuite mismatch. Solution Different types of clients have their own mechanism to determine TLS version. For example, Web browsers - IE, Edge, Chrome, Firefox have their own set of TLS versions. Applications have their own library to define TLS version. Operating system level like windows also supports to define TLS version. Web browser In the latest Edge and Chrome, TLS 1.0 and TLS 1.1 are deprecated. TLS 1.2 is the default TLS version for these 2 browsers. Below are the steps of setting TLS version in Internet Explorer and Firefox and are working in Window 10. Internet Explorer Search Internet Options Find the setting in the Advanced tab. Firefox Open Firefox, type about:config in the address bar. Type tls in the search bar, find the setting of security.tls.version.min and security.tls.version.max. The value is the range of supported tls version. 1 is for tls 1.0, 2 is for tls 1.1, 3 is for tls 1.2, 4 is for tls 1.3. Windows System Different windows OS versions have different default TLS versions. The default TLS version can be override by adding/editing DWORD registry values ‘Enabled’ and ‘DisabledByDefault’. These registry values are configured separately for the protocol client and server roles under the registry subkeys named using the following format: <SSL/TLS/DTLS> <major version number>.<minor version number><Client\Server> For example, below is the registry paths with version-specific subkeys: Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.2\Client For the details, please refer to Transport Layer Security (TLS) registry settings | Microsoft Learn. Application that running with .NET framework The application uses OS level configuration by default. For a quick test for http requests, you can add the below line to specify the TLS version in your application before TLS connection is established. To be on a safer end, you may define it in the beginning of the project. ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12 Above can be used as a quick test to verify the problem, it is always recommended to follow below document for best practices. https://docs.microsoft.com/en-us/dotnet/framework/network-programming/tls Java Application For the Java application which uses Apache HttpClient to communicate with HTTP server, you may check link How to Set TLS Version in Apache HttpClient | Baeldung about how to set TLS version in code. Cipher suite mismatch Like TLS version mismatch, CipherSuite mismatch can also be tested with the tools that introduced in previous article. Detection In the network packet, the connection is terminated after Client Hello, so if you do not see a Server Hello packet, that indicates either TLS version mismatch or ciphersuite mismatch. If server is supported public access, you can also test using SSLLab(https://www.ssllabs.com/ssltest/analyze.html) to detect all supported CipherSuite. Solution From the process of establishing SSL/TLS connections, the server has final decision of choosing which CipherSuite in the communication. Different Windows OS versions support different TLS CipherSuite and priority order. For the supported CipherSuite, please refer to Cipher Suites in TLS/SSL (Schannel SSP) - Win32 apps | Microsoft Learn for details. If a service is hosted in Windows OS. the default order could be override by below group policy to affect the logic of choosing CipherSuite to communicate. The steps are working in the Windows Server 2019. Edit group policy -> Computer Configuration > Administrative Templates > Network > SSL Configuration Settings -> SSL Cipher Suite Order. Enable the configured with the priority list for all cipher suites you want. The CipherSuites can be manipulated by command as well. Please refer to TLS Module | Microsoft Learn for details. TLS certificate is not trusted Detection Access the url from web browser. It does not matter if the page can be loaded or not. Before loading anything from the remote server, web browser tries to establish TLS connection. If you see the error below returned, it means certificate is not trusted on current machine. Solution To resolve this issue, we need to add the CA certificate into client trusted root store. The CA certificate can be got from web browser. Click warning icon -> the warning of ‘isn’t secure’ in the browser. Click ‘show certificate’ button. Export the certificate. Import the exported crt file into client system. Windows Manage computer certificates. Trusted Root Certification Authorities -> Certificates -> All Tasks -> Import. Select the exported crt file with other default setting. Ubuntu Below command is used to check current trust CA information in the system. awk -v cmd='openssl x509 -noout -subject' ' /BEGIN/{close(cmd)};{print | cmd}' < /etc/ssl/certs/ca-certificates.crt If you did not see desired CA in the result, the commands below are used to add new CA certificates. $ sudo cp <exported crt file> /usr/local/share/ca-certificates $ sudo update-ca-certificates RedHat/CentOS Below command is used to check current trust CA information in the system. awk -v cmd='openssl x509 -noout -subject' ' /BEGIN/{close(cmd)};{print | cmd}' < /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem If you did not see desired CA in the result, the commands below are used to add new CA certificates. sudo cp <exported crt file> /etc/pki/ca-trust/source/anchors/ sudo update-ca-trust Java The JVM uses a trust store which contains certificates of well-known certification authorities. The trust store on the machine may not contain the new certificates that we recently started using. If this is the case, then the Java application would receive SSL failures when trying to access the storage endpoint. The errors would look like the following: Exception in thread "main" java.lang.RuntimeException: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at org.example.App.main(App.java:54) Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:371) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:314) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:309) Run the below command to import the crt file to JVM cert store. The command is working in the JDK 19.0.2. keytool -importcert -alias <alias> -keystore "<JAVA_HOME>/lib/security/cacerts" -storepass changeit -file <crt_file> Below command is used to export current certificates information in the JVM cert store. keytool -keystore " <JAVA_HOME>\lib\security\cacerts" -list -storepass changeit > cert.txt The certificate will be displayed in the cert.txt file if it was imported successfully.52KViews4likes0CommentsSSL/TLS connection issue troubleshooting guide
You may experience exceptions or errors when establishing TLS connections with Azure services. Exceptions are vary dramatically depending on the client and server types. A typical ones such as "Could not create SSL/TLS secure channel." "SSL Handshake Failed", etc. In this article we will discuss common causes of TLS related issue and troubleshooting steps.40KViews9likes1CommentAzure Service Fabric | EnableAutomaticUpdates, EnableAutomaticOSUpgrade, FabricUpgradeMode
This article explains the above 3 properties in detail. Though these three properties sound similar, but are quite different from each other. Let’s talk about them. EnableAutomaticUpdates This property is used to enable automatic updates for Windows Virtual Machine. Its default value is true. Ideally, when we set “enableAutomaticUpdates” to true, we are enabling windows updates i.e. patch upgrades, etc. Need to mention the upgrades are not with respect to Windows Upgrade i.e. not from Windows 2012 to Windows 2016 or similar. Once configured, the latest OS image published by image publishers is automatically applied to the scale set without user intervention. However, for the effect to take place in the Azure portal under VMSS section, we need to reimage the nodes and then only the corresponding value will be displayed in “Windows automatic updates.” Please refer below screenshot where we have configured “enableAutomaticUpdates” to true: EnableAutomaticOSUpgrade On the other hand, “enableAutomaticOSUpgrade” indicates whether OS upgrades should automatically be applied to scale set instances in a rolling fashion when a newer version of the OS image becomes available. Default value is false. If this is set to true for Windows based scale sets, recommendation is to set enableAutomaticUpdatesto false. The caveat here is that the upgrades to the VM’s are applied in a rolling fashion and therefore there shall be no downtime to the applications deployed. However, the specific node undergoing the upgrade shall go down for some time till the upgrade is finished. There are some prerequisites for this property which are mentioned in this link https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade. The above 2 properties can be seen in resources.azure.com as below: Fabric Upgrade Mode Fabric Upgrade Mode is applicable to the Service Fabric Code and Configuration. Microsoft announces service fabric upgrades regularly and the same can be found in service fabric team blog. Usually for a specific version the support gets ended after a minimum of 60 days from date of announcement. We can either use an Automatic Upgrade Mode or can change to Manual Mode. We need to keep in mind that Microsoft introduces lot of security upgrades/fixes, bug fixes, etc. with the release of new versions therefore its always recommended to use the latest Service Fabric Version. For a production cluster, we recommend using Manual Mode and explicitly upgrade the SF version whenever a new release is introduced. We can probably do this in off production hours. Some Use cases If “enableAutomaticUpdates” is set to true, upgrade mode is set to Automatic and “enableAutomaticOSUpgrade” is set to False, this means we are allowing OS upgrades to get installed automatically in all the VM’s in one go and there is a high possibility where a number of VM’s nodes in a cluster will go down. However, if we change the “enableAutomaticOSUpgrade” to true, the updates will be implemented in a rolling fashion i.e. one node at a time and this shall not cause any downtime for your applications deployed. In case we want to disable the windows updates, we need to set “enableAutomaticUpdates” as false and as mentioned in the article https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.compute.models.windowsconfiguration.enableautomaticupdates?view=azure-dotnet the updates will take effect only on OS re-provisioning i.e. reimaging the node one by one. Reimaging will put the specific node down for short bit of time but as your applications are deployed in different nodes, there will be no downtime. We need to note here that the above step is not a recommended option as there are updates based on security vulnerabilities, bug fixes, patches, etc.12KViews2likes0CommentsNot enough disk space issue in Service Fabric cluster
Time by time, when you use Service Fabric cluster, the cluster may meet different issues and in the reported error/warning message, it’s marked that one or some specific nodes do not have enough disk space. This may be caused by different reasons. This blog will talk about the common solution of this issue. Possible root causes: There will be lots of possible root causes for the not enough disk space issue. In this blog, we'll mainly talk about following five: Diagnostic log files (.trace and .etl) consumes too much space Paging file consumes too much space Too many application packages existed in node Too many registered versions of application type Too many images existed (Only for cluster used with container) To better identify which one is matching your own scenario, please kindly check the following description of them: For the log files, we can RDP into the node reporting not enough disk space and check the size of the folder D:\SvcFab\Log. If the size of this folder is bigger than expected, then we can try to reconfigure the cluster to decrease the size limit of the diagnostic log files. For the paging files, it's a built-in feature of Windows system. For detailed introduction, please check this document. To verify if we got this issue, we can RDP into the node and check whether we can find the hidden file D:\pagefile.sys. If we can find it, that means your service fabric cluster is consuming some disk space as RAM. We can consider about configuring the Paging file to be saved in disk C instead of disk D. For too many application packages existed in node which consume too much disk space, we can verify it in Service Fabric Explorer (SFX). By visiting SFX from Azure Portal Service Fabric overview page, we can turn to the Image Store page of the cluster and verify whether there is any record with name different from Store and WindowsFabricStore. If yes, then please click on Load Size button to check its size. Similar to point 3, for too many registered versions of application type, we can check same page, but pay attention to size of Store and see if it consumes lots of disk space. When a version of application type is registered in SF cluster, SF will save some files used for deploy the services included by this new version into each node. More versions are registered, more disk space it will consume. For the too many images existed cause, this will only happen when our service fabric cluster is running with Container feature. We can RDP into the node and use command docker image ls to list all used images on this node. If there are some images which were used before but not removed/pruned even after no longer being used, it will consume a lot of disk space since the image file is normally with a huge size. For example, the size of image for windows server core is more than 10 GB. Possible solutions: Then let's talk about the solutions of the above four kinds of issue. 1. To reconfigure the size limit of the diagnostic log files, we need to open a PowerShell command window with Az module installed. Please refer to the official document for how to install. After login successfully, we can use the following command to set the expected size limit. Set-AzServiceFabricSetting -ResourceGroupName SF-normal -Name sfhttpjerry -Section Diagnostics -Parameter MaxDiskQuotaInMB -Value 25600 Please remember to replace the resource group name, service fabric cluster name and the number for size limit by yourself before running command. Once this command is run successfully, it may not take effect immediately. The Service Fabric cluster will scan size of the diagnostic log periodically. We need to wait until next scan is triggered. Once it's triggered and if the size of the diagnostic log files are bigger than your setting number (25600MB = 20 GB in my example), cluster will automatically delete some log files to release more disk space. 2. To change the path of paging file, we can follow these steps to switch. Check the status of our Service Fabric cluster in Service Fabric Explorer to make sure every node, service and application is healthy. RDP into the VMSS node In the Search bar, type in "Advanced System Setting". Then choose Advanced -> Advanced -> Change -> Next is to set D drive to No Paging file and set C drive to System Managed Size. This setting change will need user to reboot the VMSS node to take effect. Please reboot the node and wait until everything is back to healthy status in Service Fabric Explorer before RDP into next node. Repeat above steps for all nodes. 3. To clean up the application package, this is easy to do in SFX. Once we visit SFX, go to the same Image Store page as how we check the issue for this kind of root cause. Then on the left side, there will be a menu to delete the unneeded package. After typing in the name in confirmation window and select Delete Image Store content, cluster will automatically delete the unneeded application on every node. 4. For the issue caused by too many registered versions of application type, we need to manually unregister the not needed versions. In the Service Fabric Explorer, we can click on the Application/Application type to see the currently existing versions. If there is any not currently used and no more needed versions, please use following command to unregister: Unregister-ServiceFabricApplicationType -ApplicationTypeName "application type name" -ApplicationTypeVersion "version number" -Force 5. For the issue caused by too many images, we can configure the cluster to automatically deleted the unused image. The detailed configuration can be found in this document and the way about how to update cluster configuration will be as following: a. Visit Azure Resource Explorer with Read/Write mode, login and find the Service Fabric cluster. b. Click Edit button and modify the json format cluster configuration as expected. In this solution, it will be to add some configuration into fabricSettings part. c. Send out the request to save the new configuration by clicking the green PUT button and wait until the provisioning status of this cluster becomes Succeeded. To make this solution work, there is one more thing which we need to do is to unregister all unnecessary and unused applications. This can also be done by the command documented here. Since the parameter ApplicationTypeName and ApplicationTypeVersion are both required for this command, that means we can only unregister one version of an application type after running the command once. But since maybe you may have very many versions and many application types, here are 2 following possible ways: If there is/are actually some versions of some application types which you want to keep it registered for future use in this cluster, please unregister those unnecessary versions by running command Unregister-ServiceFabricApplicationType -ApplicationTypeName VotingType -ApplicationTypeVersion 1.0.1 (Remember to replace the ApplicationTypeName and ApplicationTypeVersion and Use step 2.e to connect to cluster at first.) If there isn't any version of any application type which you want to keep specially, which means we only need to keep the application type being used by running application, then we can use step 2.e to connect to cluster and then run the following script: $apptypes = Get-ServiceFabricApplicationType $apps = Get-ServiceFabricApplication $using = $false foreach ($apptype in $apptypes) { $using = $false foreach ($app in $apps) { if ($apptype.ApplicationTypeName -eq $app.ApplicationTypeName -and $apptype.ApplicationTypeVersion -eq $app.ApplicationTypeVersion) { $using = $true break } } if ($using -eq $false) { Unregister-ServiceFabricApplicationType -ApplicationTypeName $apptype.ApplicationTypeName -ApplicationTypeVersion $apptype.ApplicationTypeVersion -Force } } In additional to the above four possible causes and solutions, there are three more possible solutions for the "Not enough disk space" issue. The following is the explanation. Scale out VMSS: Sometimes scaling out, which means increasing the number of nodes of Service Fabric cluster will also help us to mitigate the disk full issue. This operation will not only be useful to improve the CPU and memory usage, but will also auto-balance the distribution of the services among nodes to improve the disk usage. When using the Silver or higher durability, to scale out the VMSS instances, we can scale out the nodes number in the VMSS directly. Scale up VMSS: It is easy to understand this point. Since the issue is about the full disk, we can simply change the VM sku to a bigger size to have bigger disk space. But please kindly check all above solutions at first to make sure everything is reasonable and we do really need more disk size to handle more data. For example, if our application is with stateful services and the full disk happens due to our stateful services save too much data, then we should consider about improving the code logic but not scaling out the VMSS at first. Otherwise, with bigger VM sku, the issue will still reproduce sooner or later. To scale up the VMSS, we can following two ways: We can use the command Update-AzVmss to update the state of a VMSS. This is the simple way however the solution is not recommended, because there is a little risk of data loss/instance down. When using the Silver or higher durability, the risk can be mitigated because they support repair tasks. The second way to upgrade the size of the SF primary node type is adding the new node type with the bigger SKU. The option is much more difficult than the option one but officially recommended and you can check the document for more information. Reconfigure the ReplicatorLog size: Please be careful that the ReplicatorLog is not saving any kind of log file. It's saving some important data of both Service Fabric cluster and application. Delete this folder will possibly cause data loss. And the size of this folder is fixed to the configured size, by default 8 GB. It will always be the same size no matter how much data is saved. It's NOT recommended to modify this setting. You should only do it if you absolute have to do so. It may run the risk of data loss. This should only be done if absolutely required. For the ReplicatorLog size, as mentioned above, the key point is to add a customized ktlLogger setting into the Service Fabric cluster. To do that, we need to: a. Visit Azure Resource Explorer with Read/Write mode, login and find the Service Fabric cluster. b. Add the ktlLogger setting into fabricSettings part. The expected expression will be such as following: { "name": "KtlLogger", "parameters": [{ "name": "SharedLogSizeInMB", "value": "4096" }] } c. Send out the request to save the new configuration by clicking the green PUT button and wait until the provisioning status of this cluster becomes Succeeded. d. Visit SFX and check the status to make sure everything is in healthy state. e. Open a PowerShell command window from a computer where the cluster certificate is installed. If the Service Fabric module is not installed yet, please refer to our document to install at first. Then run following command to connect to the Service Fabric cluster. Here the thumbprint is the one of the cluster certificate and also remember to replace the cluster name by correct URL. $ClusterName= "xxx.australiaeast.cloudapp.azure.com:19000" $CertThumbprint= "7279972D160AB4C3CBxxxxx34EA2BCFDFAC2B42" Connect-serviceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 -X509Credential -ServerCertThumbprint $CertThumbprint -FindType FindByThumbprint -FindValue $CertThumbprint -StoreLocation CurrentUser -StoreName My f. Use command to disable one node from Service Fabric cluster. (_nodetype1_0 in example code) Disable-ServiceFabricNode -NodeName "_nodetype1_0" -Intent RemoveData -Force g. Monitor in SFX until the node included in last command is with status disabled. h. RDP into this node and manually delete the D:\SvcFab\ReplicatorLog folder. Attention! This operation will remove all logs in ReplicatorLog. Please double-confirm whether any context there is still needed before deletion. i. Use following command to enable the disabled node. Monitor until the node is with status Up. Enable-ServiceFabricNode -NodeName "_nodetype1_0" j. Wait until everything is healthy in SFX and repeat step f to step i on every node. After that, the ReplicatorLog folder in node will be with new customized size.9.8KViews3likes2CommentsForce delete application while the Application/Service stuck in deleting state
Few days back i was working on scenario where I was unable to delete the application and service from Azure Service fabric explorer and PowerShell. By looking at the application state in Service Fabric Explorer the application was in deleting state. Executing the PowerShell command left out with the following error: Remove-ServiceFabricApplication -ApplicationName fabric:/Voting -Force -TimeoutSec 350 Remove-ServiceFabricApplication : Operation timed out. At line:1 char:1 + Remove-ServiceFabricApplication -ApplicationName fabric:/Voting -Forc ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : OperationTimeout: (Microsoft.Servi...usterConnection:ClusterConnection) [Remove-ServiceFabricApplication], TimeoutException + FullyQualifiedErrorId : RemoveApplicationInstanceErrorId,Microsoft.ServiceFabric.Powershell.RemoveApplication At this scenario, there are PowerShell cmdlet switches which play a vital role and can help to mitigate the issue. I have tried using -ForceRemove switch. The following cmdlet helps me to delete all application that are distributed across multiple nodes. $ClusterName= "clustername.cluster_region.cloudapp.azure.com:19000" $Certthumprint = "{replace_with_ClusterThumprint}" Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 ` -X509Credential ` -ServerCertThumbprint $Certthumprint ` -FindType FindByThumbprint ` -FindValue $Certthumprint ` -StoreLocation CurrentUser ` -StoreName My $ApplicationName = "fabric:/voting" foreach($node in Get-ServiceFabricNode) { [void](Get-ServiceFabricDeployedReplica -NodeName $node.NodeName -ApplicationName $ApplicationName | Remove-ServiceFabricReplica -NodeName $node.NodeName -ForceRemove) } Remove-ServiceFabricApplication -ApplicationName $ApplicationName -Force There might be situation where -Force switch could also fail. In that scenario try with 1) Move the Cluster Manager primary to another node and try to remove the app again: a) Move-ServiceFabricPrimaryReplica -PartitionId [Cluster Manager System Service Partition Id] -ServiceName fabric:/System/ClusterManagerService b) Remove-ServiceFabricApplication -ApplicationName fabric:/voting -Force -ForceRemove 2) If it still doesn't work due to stuck services or PLB affinity, try to move the primary for each partition of NamingService and try again. There might also a scenario that the issue is happening with only one service distributed in multiple nodes instead of the whole application. Then use the below script which is like the one for application but with a little change which will only delete a specific Service. $ClusterName= "clustername.cluster_region.cloudapp.azure.com:19000" $Certthumprint = "{replace_with_ClusterThumprint}" Connect-ServiceFabricCluster -ConnectionEndpoint $ClusterName -KeepAliveIntervalInSec 10 ` -X509Credential ` -ServerCertThumbprint $Certthumprint ` -FindType FindByThumbprint ` -FindValue $Certthumprint ` -StoreLocation CurrentUser ` -StoreName My $ApplicationName = "fabric:/voting" $ServiceName = "fabric:/Voting/FEService" foreach($node in Get-ServiceFabricNode) { [void](Get-ServiceFabricDeployedReplica -NodeName $node.NodeName -ApplicationName $ApplicationName | Where-Object {$_.ServiceName -match $ServiceName} | Remove-ServiceFabricReplica -NodeName $node.NodeName -ForceRemove) } Remove-ServiceFabricService -ServiceName $ServiceName -Force Hope this helps.7.6KViews2likes2CommentsHow to enable IPv4+IPv6 dual-stack feature on Service Fabric cluster
As the IPv4 addresses are already exhausted, more and more service providers and website hosts start using the IPv6 address on their server. Although the IPv6 is a new version of IPv4, their packet headers and address format are completely different. Due to this reason, users can consider IPv6 as a different protocol from IPv4. In order to be able to communicate with a server with IPv6 protocol only, we’ll need to enable the IPv6 protocol on Service Fabric cluster and its related resources. This blog will mainly talk about how to enable this feature on Service Fabric cluster by ARM template. Prerequisite: We should be familiar with how to deploy a Service Fabric cluster and related resources by ARM template. This is not only about downloading the ARM template from the official document, but also including preparing a certificate, creating a Key Vault resource and uploading the certificate into the Key Vault. For detailed instructions, please check this official document. Abbreviation: Abbreviation Full name VMSS Virtual Machine Scale Set SF Service Fabric NIC Network Interface Configuration OS Operation System VNet Virtual Network Limitation: Currently all the Windows OS Image with container support, which means the name is ending with -Containers such as WindowsServer 2016-Datacenter-with-Containers, does not support to enable IPv4+IPv6 dual stack feature. The design changes before and after enabling this feature: Before talking about the ARM template change, it’s better to have a full picture about the changes on each resource type. Before enabling the IPv6 dual stack feature, the traffic flow of the Service Fabric will be as following: Client sends request to the public IP address with IPv4 format address. The protocol used is IPv4. Load Balancer listens to its frontend public IP address and decides which VMSS instance to route this traffic according to 5-tuple rule and the load balancing rules Load Balancer forwards the request to the NIC of the VMSS NIC of the VMSS forwards the request to the specific VMSS node with IPv4 protocol. The internal IP addresses of these VMSS nodes are also in IPv4 format After enabling the IPv6 dual stack feature, the traffic flow of the Service Fabric will be as following: (The different parts are in bold.) Client sends request to one of the public IP addresses associated with Load Balancer. One of them is with IPv4 format address and another is with IPv6 format address. The protocol used can be IPv4 or IPv6 depending on which public IP address the client sends request to Load Balancer listens to its frontend public IP addresses and decides which VMSS instance to route this traffic according to 5-tuple rule. Then according to the load balancing rules and incoming request protocol, Load Balancer decides which protocol to forward the request Load Balancer forwards the request to the NIC of the VMSS NIC of the VMSS forwards the request to the specific VMSS node. The protocol here is decided by Load Balancer in second point. These VMSS nodes have both IPv4 and IPv6 format internal IP addresses By comparing the traffic before and after enabling this dual stack feature, it’s not difficult to find the different points, which are also the configuration to be changed: Users need to create a second public IP address with IPv6 address and set the first public IP address type to IPv4. Users need to add the IPv6 address range into the VNet and subnet used by our VMSS. Users need to add additional load balancing rule in Load Balancer for IPv6 traffic. Users need to modify the NIC of the VMSS to accept both IPv4 and IPv6 protocol. In addition to the above four points, users also need to remove the nicPrefixOverride setting of the VMSS SF node extension because this override setting currently doesn't support IPv4+IPv6 dual stack yet. There isn’t risk of removing this setting because this setting only takes effect when the SF cluster works with containers which won't be in this scenario due to the limitation of the OS image. Changes to the ARM template: After understanding the design changes, the next part is about the changes in the ARM template. If users are going to deploy a new SF cluster with this feature, here are the ARM template and parameter files. After downloading these files: Modify the parameter values in parameter file Decide whether to change the value of some variables. For example, the IP range of the VNet and subnet, the DNS name, the Load Balancer resource and load balancing rule name etc. These values can be customized according to users’ own requirements. If this is for test purposes only, this step can be skipped. Use the preferred way, such as Azure Portal, Azure PowerShell or Azure CLI, to deploy this template. If users are going to upgrade the existing SF cluster and resources to enable this feature, here are the points which users will need to modify in their own ARM template. The template of this Blog is modified based on the official example ARM template. To understand the change in template more easily, the templates before and after the change are both provided here. Please follow the explanation below and compare the two templates to see what change is needed. In the variables part, multiple variables should be added. These variables will be used in the next parts. Tips: As documented here, the IP range of the subnet with IPv6 (subnet0PrefixIPv6) must be ending with “/64”. Variable name Explanation dnsIPv6Name DNS name to be used on the public IP address resource with IPv6 address addressPrefixIPv6 IPv6 address range of the VNet lbIPv6Name IPv6 public IP address resource name subnet0PrefixIPv6 IPv6 address range of the subnet lbIPv6IPConfig0 IPv6 IP config name in load balancer lbIPv6PoolID0 IPv6 backend pool name in load balancer For VNet: Add the IPv6 address range into VNet address range and subnet address range. For public IP address: Set publicIPAddressVersion property of existing public IP address to IPv4 Create a new public IP address resource with IPv6 address For Load Balancer: (Referred document) a. Add IPv6 frontend IP configuration b. Add IPv6 backend IP address pool c. Duplicate every existing load balancing rule into IPv6 version (Only one is shown as example here) Note: As documented here, the idle timeout setting of IPv6 load balancing rule is not supported to be modified yet. The default timeout setting is 4 minutes. d. Modify depending resources For VMSS: (Referenced document) Remove nicPrefixOverride setting from the SF node extension b. Set primary and privateIPAddressVersion property of existing IP Configuration and add the IPv6 related configuration into NIC part. For SF: Modify the depending resources Tips: This upgrade operation sometimes will take long time. My test took about 3 hours to finish the upgrade progress. The traffic of IPv4 endpoint will not be blocked during this progress. The result of the upgrade: Before: After: The way to test IPv6 communication to SF cluster: To verify whether the communication to SF cluster by IPv6 protocol is enabled, we can do it this way: For Windows VMSS: Please simply open the Service Fabric Explorer website with your IPv6 IP address or domain name. The domain name of IPv4 and IPv6 public IP address can be found in your public IP address overview page: (IPv6 public IP address as example here) The result with IPv4 domain URL: https://sfjerryipv6.eastus.cloudapp.azure.com:19080/Explorer For IPv6, we only need to replace the part between https:// and :19080/Explorer by the domain name in your IPv6 public IP address page, such as: https://sfjerryipv6-ipv6.eastus.cloudapp.azure.com:19080/Explorer If the tests can both return the SF explorer page correctly, then the IPv4+IPv6 dual stack feature of SF cluster is also verified as working. For Linux VMSS: Due to design, currently the SF explorer doesn’t work on IPv6 for Linux SF cluster. To verify the traffic by IPv6 to a Linux SF cluster, please kindly deploy a webpage application which listens to 80 port and use the IPv6 domain name with 80 port the visit the website. If everything works well, it will return the same page as we visit the IPv4 domain name with 80 port. (Optional) The way to test IPv6 communication in SF cluster backend VMs: As explained in the first part, the communication from SF cluster to other remote servers with IPv6 will also be an increasing requirement. After enabling the IPv4+IPv6 dual stack feature, not only user can reach SF cluster by IPv6, but also the communication from SF cluster to other servers with IPv6 is enabled. To verify this point, we can do it this way: For Windows VMSS: RDP into whichever node, install and open Edge browser and visit https://ipv6.google.com If it can return a page as normal Google homepage, then it’s working well. For Linux VMSS: SSH into whichever node and run the command: curl -6 ipv6.google.com If the result starts with <!doctype html> and you can find data like <meta content=”Search the world’s information, …>, then it’s working well. Summary By following this step-by-step guideline, enabling the IPv4+IPv6 dual stack feature should not be a question blocking the usage of SF cluster in the future. As the SF cluster itself has a complicated design and also during this process it contains the change of multiple resources, if there is any difficulty, please do not hesitate to reach Azure customer support to ask for help.7.3KViews2likes0Comments