Core Infrastructure and Security Blog

9 MIN READ

MSDTC Best Practices with an Availability Group

Microsoft

Oct 14, 2019

In this article we are going to look at some guidelines and MSDTC best practices with an Availability Group. Please note that some of these recommendations change if you are deploying a Failover Cluster Instance. If you are using the MSDTC (BizTalk, Linked Servers, etc.) and hosting the databases in an Availability Group, there are several things we need to take into consideration.

Precedence

When an instance of SQL Server starts it enlists with a transaction coordinator and Resource Manager. Think of the coordinator as the Windows Service and the Manager as the handler between the service and SQL Server. The SQL Server instance has always been the Resource Manager until SQL 2016 when support for the MSDTC in an AG became available. In a SQL 2016 AG where per database MSDTC has been enabled, each database becomes its own Resource Manager. For more information on versions and supportability for MSDTC please refer to the “Availability Group” and “Support Matrix” sections of this article on MSDTC Supported Configurations.

The order of precedence defines which MSDTC Coordinator SQL Server will enlist with at startup. Although the precedence order has been discussed in the past, what we will see is that there really is no order. The first place SQL Server looks is to see what the Default Clustered Instance is configured for within the Windows Cluster. We can find this by opening Component Services or comexp.msc, right clicking My Computer, and selecting Properties. Here is what that looks like prior to Windows Server 2016. In Windows 2016+ you would not see the check box for “Use Local Coordinator”.

We can see that the Default Coordinator will use the local MSDTC by default (A default with a default!). In Windows 2016+ you cannot change this default and would need to disable the service. I recommend disabling the service no matter what version of Windows you are using. Note that once you disable the service for the local MSDTC, you need have a clustered MSDTC defined or you will not have anything at all. Refer to the Manually Map MSDTC section later in this article for more information. You might be tempted to leave it on thinking it could serve as a backup, but if you do not want to use the local MSDTC you MUST disable this. Also, the SQL instance only enlists with an MSDTC Coordinator once at startup and never tries again. If you enable WITH DTC_SUPPORT = PER_DB in an AG, then this occurs for each database when it starts. If we want to use a clustered MSDTC then we will need to remove that check mark and select our desired default. Here is where you can find and disable the service using services.msc.

Looking back at Component Services you will see we have a drop-down box where we can select our Default Clustered Instance. Whatever you select here is what SQL will use for its MSDTC Coordinator and completes our “order of precedence”. You may have heard about an order that describes a DTC in its own role or DTCs within SQL roles. I will describe those scenarios in the Quantity section but know that they must be manually mapped per instance to override using the default, so this is where our order ends. There are no automatic choices for SQL beyond the Default Clustered Instance since it must be defined.

Availability

A paramount concept to understand is how to make the DTC highly available. We can see from the precedence order that SQL Server will use the local DTC out of the box. This makes it appear that everything is working, and it is, but it is not exactly highly available.

I see a lot of customers leave it configured this way because they either don’t know the ramifications or do not realize they are using the MSDTC (Linked Servers). Since it simply works out of the box, things get left this way until they end up with a suspect database and error messages that look like this:

"SQL Server detected a DTC/KTM in-doubt transaction with UOW {598B7EDD-F7A1-9DC1-8D3E-303A4C93AAB4}.Please resolve it following the guideline for Troubleshooting DTC Transactions."

"An error occurred while recovering database 'MyCriticalDB'. Unable to connect to Microsoft Distributed Transaction Coordinator (MS DTC) to check the completion status of transaction (5:861236355). Fix MS DTC, and run recovery again."

"Database 'MyCriticalDB' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information."

Let’s look at an example of how this could happen. Assume we have a 2 node AG using local DTCs. Here is what that would look like.

If Node1 fails and the local MSDTC on that node has an in-flight distributed transaction, the local MSDTC on Node2 kicks in after the failover. The local MSDTC on Node2 tries to contact the local MSDTC on Node1 and retrieve the status of any in-doubt transactions. This process is attempted 3 times. If the old primary (Node1) still cannot be contacted, SQL takes its direction from the In-doubt Transaction Configuration or "in-doubt xact resolution" in sp_configure. SPOILER ALERT – This is not configured by default which means your database will go into Suspect mode…Yikes!

Is there some availability in this configuration? Yes, since the new primary node attempts to discover the transaction status from the failed primary node. Is it highly available? No, and the prior error messages demonstrate that. If the old primary does not come back online and fast enough prior to the 3 attempts from the new primary, SQL must decide whether to commit or roll back. The problem is that the status of in-flight distributed transactions is stored locally on the node. If the log was on shared storage then the Resource Manager would know the outcome.

If using the local DTC is not highly available, the next natural question is, “How do we remedy that?” The answer lies in the log for the MSDTC service. Just like SQL Server, the MSDTC has a log file it uses to track the status of distributed transactions. If we can make that log file highly available, then we make the MSDTC highly available.

There are several methods to make the MSDTC log highly available by making the storage where it is located highly available. This ensures that the MSDTC log and thus the outcome of distributed in-flight transactions is immediately available post failover. You can accomplish this using technologies like iSCSI (assuming the host has redundancy for the storage) or Storage Spaces Direct (S2D) and we have some partners with excellent solutions as well. If you want to see how to do this using S2D you can check out this post on Configure SQL Server Failover Cluster Instance on Azure Virtual Machines with MSDTC.

That post is geared toward FCIs, but the MSDTC part is identical for AGs. The one thing you will note about having shared storage to make the MSDTC highly available is that it is not congruent with the architecture of AGs. AGs do not have shared storage by design, so it seems counter intuitive to add shared storage to the design, but it is required only for the MSDTC to make it highly available like the AG.

Quantity

The last thing we want to consider is scaling out. This is completely dependent on your MSDTC workload and may not be necessary. If you have multiple instances of SQL Server running in your cluster and you create one clustered MSDTC in its own resource group, every instance uses that same MSDTC. If you have a lot of distributed transactions this can become a bottleneck.

For Failover Cluster Instances we can create an MSDTC per instance because each one is its own instance of SQL Server. However, if you have multiple AGs they are most likely in the same instance of SQL Server and every AG will use the same MSDTC. In this situation, you would need to spread the AGs across multiple instances.

The other issue that can get introduced with a single clustered MSDTC in its own role, is that you cannot guarantee it will run on the same node as your SQL instance. You could easily have an instance running on Node2 and using an MSDTC running on Node1. This can introduce network latency. The best practice here would be to put the MSDTC for the instance running your AG in the same Windows Cluster Role.

If you decide to have multiple MSDTCs for multiple instances and put them in their respective roles, you will need to manually map them or they will continue to use the Default Clustered Instance.

Manually Map MSDTC

To manually map MSDTC to an instance of SQL Server we will need to use the msdtc.exe command line tool. To start off we simply want to check if any mappings already exist using this command.

msdtc -tmMappingView *

To add a mapping, we use the -tmMappingSet parameter along with -name, -service, and -ClusterResourceName. An example would look like this.

msdtc -tmMappingSet -name MyMSDTC -service MSSQLServer -ClusterResourceName ClusterDTC1

Let’s look at the parameters to understand what they are asking. The -name parameter is the name you want to give to the mapping and can be anything you want to call it. The -service parameter is the name of the SQL Server Service running the instance of SQL Server you want to map to. If it is a default instance it will be MSSQLServer and if it is a named instance it will be MSSQL$<InstanceName>. The -ClusterResourceName is the name of the MSDTC instance you created in the cluster that you want to map.

If you need to remove a mapping that uses the -tmMappingClear parameter and would look like this. Note that the -name parameter is the name of the mapping you chose in the step above.

msdtc -tmMappingClear -name MyMSDTC

Verify MSDTC

If you’re like me, you want to verify your configuration and make sure your SQL instance is using the MSDTC you expect. It’s a pretty simple two step process. First, we start a distributed transaction on our instance and second, we look in component services to see the transaction hit the coordinator. In my test environment I have the local MSDTC service disabled, an MSDTC in its own role, and an MSDTC in the same role as my AG. The MSDTC in its own role is configured as the Default Clustered Instance and the MSDTC in the AG role is manually mapped to the instance of SQL Server hosting my AG. Here is the code I will use.

USE App1AG_DB1;

BEGIN DISTRIBUTED TRANSACTION

COMMIT TRANSACTION;

ROLLBACK TRANSACTION;

I have the COMMIT and ROLLBACK statements in a comment block so I can see the transaction hit the coordinator and then I can choose to either commit or roll it back and see that in component services. Once I run the code that is not commented, here is what I see in Component Services.

I can see that I have 1 transaction that is active in my APP1AG MSDTC. That is the MSDTC I have in my AG role and is what I expected since I manually mapped it to override the Clustered Default which is the MYMSDTC. When I look at the Local MSDTC and MYMSDTC Transaction Statistics I see all zeros which confirms I am only using the APP1AG MSDTC. If I run the ROLLBACK statement, here is what I see in Component Services.

Best Practices Conclusion

The first thing is to understand how SQL Server enrolls with a transaction coordinator, so you know which one it is going to use. It’s the local MSDTC by default and we need to disable that service to use a clustered MSDTC and then define it in Component Services.

The second thing is to use a clustered MSDTC instead of a local one so we can make it highly available. This means we need to change the Default Coordinator and have some type of shared storage to host the MSDTC log file. For an example of shared storage using Storage Spaces Direct check out this article on Configure SQL Server Failover Cluster Instance on Azure Virtual Machines with MSDTC. It’s geared toward FCIs, but the Storage Spaces configuration is the same when deploying AGs.

The third thing is to consider scaling out and using 1 MSDTC per instance. The only way to scale this out with multiple AGs is to put each AG in its own instance. Remember that manually mapping is the only way to override the setting in the Default Clustered Instance and ensure that each instance hosting an AG is using a specific MSDTC.

Last is to make sure we put our MSDTC inside the role of our AG so when the AG fails over, the MSDTC will go with it. Otherwise we run the risk of introducing network latency with the MSDTC and SQL Instance on different nodes.

Updated Oct 14, 2019

Version 2.0

RyanAdams

ryanjadams

Microsoft

Joined February 27, 2019

View Profile

Core Infrastructure and Security Blog

Follow this blog board to get notified when there's new activity