This Document helps to troubleshoot or Resolve Replication performance related issues.
Get information about “Topology Big Picture”:-
Before you dive into solving any issue, you need to fully understand the type of environment you have as there might have been changes you are unaware of. An easy way to do that is to run ReplScripts/Replication Topology Script.sql at master · ReplTalk/ReplScripts which gives output like below.
Review the output for details about publishers, publications, and subscriptions.
1. Check for replication issues using tracer tokens:-
Purpose of tracer tokens:-
Latency measurement:- Measure the time it takes for data to travel from the publisher → distributor → subscriber.
Connection validation:- Validate whether replication connections between components are functioning correctly.
Troubleshooting: Identify bottlenecks or delays in the replication process.
Using Measure latency & validate connections (Transactional) - SQL Server | Microsoft Learn, We can identify whether the issue is from Publisher to Distributor or Distributor to Subscriber.
Below image says how data flows from Publisher to Subscriber through Distributor After applying initial Snapshot.
As mentioned in the below image
- If the latency is in between publisher -> Distributor then the issue might be in either
- LogReader-Reader Thread
- LogReader-Writer Thread
- If the latency is in between Distributor -> Subscriber then the issue might be in either
- Distribution-Reader Thread
- Distribution-Writer Thread
2. Performance Statistics Tools:-
Performance statistics were added to the mslogreader_history table and the msdistribution_history table on the distribution database in Microsoft SQL Server. You can use these statistics to see the recent performance history of the Replication Log Reader and Replication Distribution agents.
Every five minutes, the performance statistics for the Log Reader and Distribution agents are recorded in the history tables. By default, only the data for the last 48 hours is retained. A cleanup process removes the data that is older than 48 hours. The default value can be changed by executing the sp_changedistributiondb stored procedure and specifying a new value for the history_retention parameter.
The following is a sample performance output from the history table for the Log Reader Agent:
- <stats state="1" work="9" idle="295"> <reader fetch="8" wait="0"/> <writer write="9" wait="0"/> <sincelaststats elapsedtime="304" work="9" cmds="52596" cmdspersec="5753.000000"> <reader fetch="8" wait="0"/> <writer write="9" wait="0"/> </sincelaststats> </stats>
There are three state events that can be recorded:
Here’s the Markdown representation of your table:
| State | Description |
| 1 | Normal events that describe both the reader and writer thread performance |
| 2 | Raised events that occur when an agent's reader thread waits longer than the agent's `-messageinterval` time. (By default, the time is 60 seconds.) If you notice State 2 events that are recorded for an agent, this indicates that the agent is taking a long time to write changes to the destination |
| 3 | Raised events that are generated only by the Log Reader Agent when the writer thread waits longer than the `-messageinterval` time. If you notice State 3 events that are recorded for the Log Reader Agent, this indicates that the agent is taking a long time to scan the replicated changes from the transaction log |
2.1. Log Reader Agent Reader Thread :- The following performance statistics demonstrate a situation in which there is latency in the replication topology and in which the bottleneck is the Log Reader Agent reader thread. The Log Reader Agent reader thread scans the published database transaction log for commands to deliver to the distribution database(<Distribution server>..MSlogreader_history.Comments)
<stats state="1" work="301" idle="0" >
<reader fetch="278" wait="0"/>
<writer write="12" wait="288"/>
<sincelaststats elapsedtime="301" work="301" cmds="104500" cmdspersec="347.000000">
<reader fetch="278" wait="0"/>
<writer write="12" wait="288"/>
</sincelaststats>
</stats>
The sincelaststats writer wait statistic of 288 seconds appears high. This is the time that the writer thread waits for the reader thread to supply buffers to apply. The Log Reader Agent reader thread executes the sp_replcmds stored procedure. If you notice high writer wait threads in the Log Reader Agent performance statistics, you should investigate the performance of the Log Reader Agent execution against the publication server(Which means Logreader Agent Reader Thread) and database and then investigate the execution time of the sp_replcmds stored procedure.
2.2. Distribution Agent Reader Thread:-The following performance statistics demonstrate a situation in which there is latency in the replication topology and in which the bottleneck is the Distribution Agent reader thread. This thread queries the distribution database (< Distribution server >..MSdistribution_history.Comments table) for commands to apply at the subscriber
<stats state="1" work="14798" idle="2035">
<reader fetch="14798" wait="193"/>
<writer write="12373" wait="9888"/>
<sincelaststats elapsedtime="424" work="415" cmds="296900" cmdspersec="713.000000">
<reader fetch="415" wait="7"/>
<writer write="377" wait="212"/>
</sincelaststats>
</stats>
The sincelaststats writer wait time (212 seconds) appears high. This is the time that the writer thread waits for the reader thread to supply buffers that the writer thread can apply at the subscriber database. The Distribution Agent reader thread executes the sp_MSget_repl_commands stored procedure.
If you notice high writer wait times in the Distribution Agent performance statistics, you should investigate the performance of the Distribution Agent execution against the distribution server and database. In particular, you should investigate the execution time of the sp_MSget_repl_commands stored procedure.
2.3. Distribution Agent Writer Thread:- The following performance statistics demonstrate a situation in which there is latency in the replication topology and in which the bottleneck is the Distribution Agent reader thread. This thread queries the distribution database (< Distribution server >..MSdistribution_history.Comments table) for commands to apply at the subscriber.
Notes:- The state is 2, and the output is somewhat different from State 1 statistics. State 2 state data indicates that the reader thread had to wait longer than the Distribution Agent's configured -messageinterval value. By default, the -messageinterval value is 60 seconds.
<stats state="2" fetch="48" wait="384" cmds="1028" callstogetreplcmds="321">
<sincelaststats elapsedtime="312" fetch="47" wait="284" cmds="1028" cmdspersec="3.000000"/>
</stats>
If the -messageinterval value is increased, you may again receive State 1 statistics that resemble the following:
<stats state="1" work="1941" idle="0">
<reader fetch="717" wait="1225"/>
<writer write="1941" wait="134"/>
<sincelaststats elapsedtime="764" work="764" cmds="1170730" cmdspersec="1530.000000">
<reader fetch="258" wait="505"/>
<writer write="764" wait="50"/>
</sincelaststats>
</stats>
Note:- The sincelaststats fetch wait time of 505 seconds is very high.
If you notice high reader wait times in the Distribution Agent performance statistics, you should investigate the performance of the Distribution Agent execution against the subscriber server and database. Use the profiler trace tool to investigate the performance of the execution of the replication stored procedures. Usually the stored procedures are named as follows:
- sp_MSupd_<ownertablename>
- sp_MSins_<ownertablename>
- sp_MSdel_<ownertablename>
Additionally, to determine whether the bottleneck is hardware-based or system-based, use the performance monitor to monitor system performance. For example, use the performance monitor to monitor the Physical Disk counters.
Kindly refer below link to know more about Replication Performance Tools: Statistics for Log Reader and Distribution agents - SQL Server | Microsoft Learn
In some scenarios, we will not be able to get the required information from above mentioned history tables. In such scenarios we need to generate Replication Verbose Logs as mentioned in the link: https://learn.microsoft.com/en-us/sql/relational-databases/replication/troubleshoot-tran-repl-errors?view=sql-server-ver17#enable-verbose-logging-on-any-agent
3. Other Issues and Resolutions:-
Issue1:- In the environment where we have many publications some are moving forward and some are not moving forward, some are moving slowly.
Observations1:-
- From the distributor verbose log, we observe the following error message: “2025-03-05 12:54:20.538 The concurrent snapshot for publication '<PublicationName>' is not available because it has not been fully generated or the Log Reader Agent is not running to activate it. If generation of the concurrent snapshot was interrupted”
- This is coming from the procedure “sys.sp_MSsubscription_status”. This happens because the status of the subscription 3 ("initiated") and the sync_type is automatic. The log reader must read when the snapshot agent started and ended from the log, after this, it will then update the MSsubscription table.
Resolution1:-The problem was there is a large volume of data for the log reader to process. It must be allowed to move forward and read through the log to completion after waiting some time.
Issue 2:- In transactional replication to replicate data to a client's Azure SQL VM in their Tenant/Subscription. The data is slow to complete it's push during peak times.
Observations 2:- We cannot find an issue with our Distributor VM (This VM) and their VM doesn't have an apparant issue.
Resolution 2:- Root Cause here is you have two publications that are pushing large amounts of data during the late night. One of these tables, on the Subscriber side, is very large. Most of the space is taken up by indexes. (More than 50%). It has roughly 13+ billion rows. When we push data to this table and other large tables it is causing excessive reads, causing multi-threading and is resulting in the latency which won't allow the Distributor to apply the transactions in a timely fashion.
-- Removing Duplicate or unused Indexes
-- Archiving Data
-- Index Rebuild
-- Creating Missing Indexes
-- Updating Statistics
-- Table Partition
Issue 3:- Lag from the Pub/Dist to the Subscribers.
Observations 3:-
--> In SQL Server error log, there is a SQL service reboot. Distribution database taking long time to do recovery, after that seems the jobs start to have query timeout
-->During the troubleshooting, we noticed blocking issues. And we confirmed the block header was the Log Reader agent session. The log reader agent session was not stuck or hung. It’s just because it’s handling a huge transaction and it was taking that long time. And caused the blocking.
Resolutions 3:- Below mentioned agent profile properties are modified and executed update statistics, which helped improve the Log Reader agent performance.
- MaxCmdsInTran 10000 ,
- ReadBatchSize 5000 ,
-LogScanTreshhold = 750000
Issue 4:- Merge job is continuously running, and we see blow message in replication monitor "Waiting 60 Second(s) Before Polling For Further Changes."
Resolutions 4:- We reproduced the same issue in the internal lab and confirmed that using the parameter – Continuous display this message. ( if this parameter is added agent never stops and continuously checks for changes)
-Continuous:-Specifies whether the agent attempts to poll replicated transactions continually. If specified, the agent polls replicated transactions from the source at polling intervals, even if there are no transactions pending.
--> After removing this parameter, we did not receive the message
--> Additionally, we need to use – QueryTimeout parameter , so that we can keep a tab on the long running queries.
-QueryTimeOut query_time_out_seconds
Is the number of seconds before the query times out. The default is 300 seconds. The Merge Agent also uses the value of QueryTimeout to determine how long to wait for generation of a partitioned snapshot when this value is greater than 1800.
Issue 5:- Data Replication is taking huge time
Observations 5:- We observed below observations that
- VM has been constantly at 100% of its overall disk bandwidth during the last 24 hours prior to data collection
- Data Limit: This VM has a data limit of 256 MBps. The server has multiple P30s capable of reaching 200 MB of throughput. Regardless of whether stripes are used, the disks are limited to the VM's maximum capacity.
- Lazy Writes Per Second: Ideally, the value for Lazy Writes Per Second should be close to zero.
- Page Life Expectancy (PLE): A higher PLE indicates better performance, as it means pages stay longer in the buffer pool (memory cache). This reduces the need to read data from the hard drive. Currently, the PLE value is very low.
- Head Blockers: There are a significant number of blockings, mostly involving the cleanup job. The log reader agent was the head blocker in two instances, with an average duration of under 20 seconds.
- Log Reader Agent: The log reader agent appears to be blocking some index operation jobs. It has an inactive status with open transactions due to continuous mode.
- Large Batches of Transactions: When the Log Reader Agent processes a large number of commands or transactions, it can lead to delays in the replication process, potentially causing blocking as other processes wait for the agent to complete its work. The Log Reader Agent can cause blocking issues in SQL Server transactional replication primarily due to large batches of replicated transactions or a high percentage of non-replicated transactions within the transaction log, leading to increased latency and resource contention.
- High Percentage of Non-Replicated Transactions: If a significant portion of the transaction log contains transactions that are not marked for replication, the Log Reader Agent needs to scan over these transactions, resulting in increased latency and potentially causing blocking.
-
Resource Constraints: If the server has insufficient resources (CPU, memory, disk I/O), the Log Reader Agent might be delayed, leading to blocking issues.
as previously discussed, the last two options above are the likelihood for the blocking.
Resolution 5:-
- ETL Operation/to or from non-replicated:- As previously discussed, my top recommendation is to reach out to the Dev team to see if they can move their ETL operation to another database without replication. If a significant portion of the transaction log contains transactions that are not marked for replication, the Log Reader Agent needs to scan over these transactions, resulting in increased latency and potentially causing blocking.
- Adjust ReadBatchSize:- Another option to reduce blocking might be to adjust the ReadBatchSize parameter. Although we have done this in the past, it is worth experimenting with different values until we find the optimal setting. Please note that a higher value might not always improve performance, especially for workloads with large transactions.
- Upgrade VM SKU:- This server is constantly reaching its capacity. I recommend upgrading the VM SKU to better meet your workload capacity.
- Blocking Concerns:- If blocking is the main concern, another option is to create a separate job that checks for blocking. If the log reader agent is identified as the head blocker, the job can restart the log reader agent as shown below.
Issue 6:- Replication Performance Issues
Resolution 6:- We found that the SQL Server and disk configuration is not configured as per best practices. After following best practices from Disk and SQL Configuration perspective. Issue got resolved. Kindly refer the below link for your reference: Checklist: Best practices & guidelines - SQL Server on Azure VMs | Microsoft Learn
Issue 7:- Log reader agent delay
Observations 7:-
- After restarting log reader agent, it starts working again.
- We switched the -ReadBatchSize between 500 and 5000 and compare the performance, now keep using 500 for log processing.
- From log reader agent history, the transaction log is processing but there is much pending log. The processing seems to be slow. From the log reader agent history, get below statistics information.
Basically, In the publisher server, the logread process has a reader thread connecting to the publisher database to run SP_REPLCMDS. In the distributor server, the logread process has a writer thread connecting to the distribution database to write commands into distribution distribution.dbo.sp_MSadd_replcmds. The data reading from the publisher database should write to distribution. From below, we can see the ‘reader fetch’ is 2940s with 1s wait, which means the reader thread fetched data for 2940s and wait 1s in last calculating period. And the writer thread writing 99s with 2439s waiting.
<stats state="1" work="2539" idle="0" ><reader fetch="2940" wait="1"/><writer write="99" wait="2439"/><sincelaststats elapsedtime="526" work="526" cmds="407220" cmdspersec="772.000000"><reader fetch="526" wait="0"/><writer write="13" wait="513"/></sincelaststats><message>Normal events that describe both the reader and writer thread performance.</message></stats>
Here the writer thread long wait is actually waiting for data buffer reading in, so the slowness is in reading data stage. - Based on VM disk performance checking, LUN's which belongs to log file do not have latency
- In log reader agent history, we noticed many messages like ‘Approximately xxx log records have been scanned in pass xx, 0 of which were marked for replication.’ These messages mean the log records scanned are not for replication so they are skipped. If there are too many unrelated log records , log reader agent will cost much time on scanning unrelated logs which will slow down the replication transaction sync.
- There was authentication issue on log reader agent yesterday which blocks the log syncing for long time, so today after resuming it accumulates large number of transactions. The large pending set also need time to consume.
Resolution 7:- There is no disk latency on log drive, the log reader agent slowness is mainly due to too many non-replication transactions and large number of accumulating batches. Best practices to follow in this scenario is
- Shrink log file before configuring replication
- Configure Replication in off-peak hours
4. Enhance Transaction Replication performance:-
1. Design Level:-
a. Database Design:-
- Minimize transaction size in your application design.
- By default, transactional replication propagates changes according to transaction boundaries. If transactions are smaller, the Distribution agent is less likely to resend a transaction due to network issues. If the agent is required to resend a transaction, the amount of data sent is smaller.
b. Distributor Configuration:-
- Configure the Distributor on a dedicated server.
- You can reduce processing overhead on the Publisher by configuring a remote Distributor. For more information, see Configure Distribution.
- Size the distribution database appropriately.
- Test replication with a typical load for your system to determine how much space is required to store commands. Ensure the database is large enough to store commands without having to auto-grow frequently. For more information about changing the size of a database, see ALTER DATABASE (Transact-SQL).
c. Publication Design:-
- Replicate stored procedure execution when making batch updates to published tables.
- If you have batch updates that occasionally affect a large number of rows at the Subscriber, you should consider updating the published table using a stored procedure and publish the execution of the stored procedure. Instead of sending an update or delete for every row affected, the Distribution Agent executes the same procedure at the Subscriber with the same parameter values. For more information, see Publishing Stored Procedure Execution in Transactional Replication.
- Spread articles across multiple publications.
- If you cannot use the -SubscriptionStreams parameter, consider creating multiple publications. Spreading articles across these publications allows replication to apply changes in parallel to Subscribers.
d. Subscription Considerations:-
- Use independent agents rather than shared agents if you have multiple publications on the same Publisher (this is the default for the New Publication Wizard).
- Run agents continuously instead of on frequent schedules.
- SQL Server Management Studio: Specify Synchronization Schedules
- Setting the agents to run continuously rather than creating frequent schedules (such as every minute) improves replication performance, because the agent does not have to start and stop. When you set the Distribution Agent to run continuously, changes are propagated with low latency to the other servers that are connected in the topology. For more information, see:
2. Agent Properties Level
a. Log Reader Agent:-
ReadBatchSize:-
- Increase the value of the -ReadBatchSize parameter for the Log Reader Agent.
Description:-
The Log Reader Agent and Distribution Agent support batch sizes for transaction read and commit operations. Batch sizes default to 500 transactions. The Log Reader Agent reads the specific number of transactions from the log, whether or not they are marked for replication. When a large number of transactions are written to a publication database, but only a small subset of those are marked for replication, you should use the -ReadBatchSize parameter to increase the read batch size of the Log Reader Agent. This parameter does not apply to Oracle Publishers.
- Workloads of smaller transactions (fewer than 500 commands) see an increase in how many commands are processed per second when ReadBatchSize is increased up to 5000.
- For larger workloads (transactions with 500 to 1000 commands), increasing ReadBatchSize has little performance improvement. Increasing ReadBatchSize results in a greater number of transactions written to the distribution database in one roundtrip. This increases the time transactions and commands are visible to the Distribution Agent and introduces latency to the replication process.
PollingInterval:-
- Decrease the value of the -PollingInterval parameter for the Log Reader Agent.
Description:-
The -PollingInterval parameter specifies how often the transaction log of a published database is queried for transactions to replicate. The default is 5 seconds. If you decrease this value, the log is polled more frequently, which can result in lower latency for the delivery of transactions from the publication database to the distribution database. However, you should balance the need for lower latency against the increased load on the server from polling more frequently.
MaxCmdsInTran:-
- To resolve accidental, one-time bottlenecks use the –MaxCmdsInTran parameter for the Log Reader Agent.
Description:-
The –MaxCmdsInTran parameter specifies the maximum number of statements grouped into a transaction as the Log Reader writes commands to the distribution database. Using this parameter allows the Log Reader Agent and Distribution Agent to divide large transactions (consisting of many commands) at the Publisher into several smaller transactions when applying commands at the Subscriber. Specifying this parameter can reduce contention at the Distributor and reduce latency between the Publisher and Subscriber. Because the original transaction is applied in smaller units, the Subscriber can access rows of a large logical Publisher transaction prior to the end of the original transaction, breaking strict transactional atomicity. The default is 0, which preserves the transaction boundaries of the Publisher. This parameter does not apply to Oracle Publishers.
Note:- - MaxCmdsInTran was not designed to be always turned on. It exists to work around cases where someone accidentally performed a large number of DML operations in a single transaction (causing a delay in the distribution of commands until the entire transaction is in the distribution database, locks being held, etc.). If you routinely fall into this situation,review your applications and find ways to reduce the transaction size.
- MaxCmdsInTran is not supported if the given publication database is enabled for both Change Data Capture and replication. Using MaxCmdsInTran in this configuration may lead to data loss in CDC change tables. It may also cause PK errors if the MaxCmdsInTran parameter is added and removed while replicating a large Transaction.
b. Distributor Agent:-
SubscriptionStreams:-
- Increase the –SubscriptionStreams parameter for the Distribution Agent.
Description:-The –SubscriptionStreams parameter can greatly improve aggregate replication throughput. It allows multiple connections to a Subscriber to apply batches of changes in parallel, while maintaining many of the transactional characteristics present when using a single thread. If one of the connections fails to execute or commit, all connections will abort the current batch, and the agent will use a single stream to retry the failed batches. Before this retry phase completes, there can be temporary transactional inconsistencies at the Subscriber. After the failed batches are successfully committed, the Subscriber is brought back to a state of transactional consistency.
A value for this agent parameter can be specified using the @subscriptionstreams of sp_addsubscription (Transact-SQL).
Blocking Monitor Thread:-
- Distribution Agent maintains a blocking monitor thread that detects blocking between sessions. If the blocking monitor thread detects blocking between the sessions, Distribution Agent switches to use one session to reapply the current batch of commands that could not be applied previously.
- The blocking monitor thread can detect blocking between Distribution Agent sessions. However, the blocking monitor thread cannot detect blocking in the following situations:
- One of the sessions where blocking occurs is not a Distribution Agent session.
- A session deadlock freezes the activities of Distribution Agent.
- In this situation, Distribution Agent coordinates all the sessions to commit together as soon as their commands are executed. A deadlock among the sessions occurs if the following conditions are true:
- Blocking occurs between the Distribution Agent sessions and a session that is not a Distribution Agent session.
- Distribution Agent is waiting for all the sessions to complete executing their commands before Distribution Agent coordinates all the sessions to commit together.
CommitBatchSize:-
- Increase the value of the -CommitBatchSize parameter for the Distribution Agent.
Description:-
Committing a set of transactions has a fixed overhead; by committing a larger number of transactions less frequently, the overhead is spread across a larger volume of data. Increasing CommitBatchSize (up to 200) can improve performance as more transactions are committed to the subscriber. However, the benefit of increasing this parameter drops off as the cost of applying changes is gated by other factors, such as the maximum I/O of the disk that contains the log. Additionally, there is a trade-off to be considered: any failure that causes the Distribution Agent to start over must roll back and reapply a larger number of transactions. For unreliable networks, a lower value can result in fewer failures and a smaller number of transactions to roll back and reapply if a failure occurs.
Refer the below link for details:
https://learn.microsoft.com/en-us/sql/relational-databases/replication/administration/enhance-transactional-replication-performance?view=sql-server-ver15