In an AlwaysOn environment, if you found your memory is slowly increasing or out of memory, you can use below way to check if it's caused by slow synchronization.
Firstly, check the errorlog or use DBCC MEMORYSTATUS to dump the memory usage, and see if the allocation mainly stayed on OBJECTSTORE_SERVICE_BROKER, for example:
OBJECTSTORE_SERVICE_BROKER (node 0) KB
---------------------------------------- ----------
VM Reserved 0
VM Committed 0
Locked Pages Allocated 0
SM Reserved 0
SM Committed 0
Pages Allocated 55028152 --near 52GB
Then we can use a dmv to check the detailed usage under this object store, identify if it is used on the part named with 'HADR...'
select * from sys.dm_os_memory_clerks where type='OBJECTSTORE_SERVICE_BROKER'
And in my last case, the 52GB memory are all used on the 'HADR Log Block Msg Pool', and this pool size is related with the log_send_queue. Before a HADR Log Block is successfully sent to secondary replica, the handle is not released in primary replica, and its allocated memory is not released too. When primary replica cannot send HADR log block messages to secondary while keep receiving control messages from secondary, primary replica thinks its connection with the secondary replica is still alive and continues generating new HADR Log Block messages. It will end up with more and more memory consumption.
So I use the AlwaysOn dashboard to check the log send queue size on secondary, it's indeed very large but I even didn't notice that, it did not give me any warning or error message.
For the log queue size large issue, we can refer to below blogs to use AG latency tool to do further troubleshooting:
And in my case, it's easy, I checked the errorlog on secondary directly and found lots of below message
2021-11-20 05:50:12.44 spid1s There have been 1284352 misaligned log IOs which required falling back to synchronous IO. The current IO is on file E:\TransactionLog\WSS_ContentLog.ldf.
2021-11-20 05:50:15.74 spid338s There have been 1284608 misaligned log IOs which required falling back to synchronous IO. The current IO is on file E:\TransactionLog\WSS_Content2020_Log.ldf.
This is a known issue we already have bunch of docs and KBs explained it:
So after I enabled the trace flag 1800, both log_send_queue size and the memory issue resolved.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.