Today, I worked on a service request that our customer faced a performance issue in their business critical service tier database. Their main suspect was that the syncronization with the secondary replica has a delay.
During their troubleshooting process they found that they got many request with the wait type is HADR_SYNC_COMMIT. This wait type indicates the time that the primary replica spends waiting for the secondary replica to harden the log records (LSN). This wait type only occurs on the primary replica and only inside a synchronous replication that we could have for every internal replica when we created a business critical or premium service tier database.
As soon as a transaction is received in the primary replica and it sends to the secondary replica for hardering, the wait type HADR_SYNC_COMMIT wait time starts counting the time until receiving the confirmation for the secondary's transaction log was completed.
Following, I would like to share with you how is possible to reproduce this and how is posible to know the latency.
I created this table including the option OPTIMIZE_FOR_SEQUENTIAL_KEY=ON for improvements when we are adding incremental values.
CREATE TABLE [MSxyzTest].[_x_y_z_MS_HighLogIO](
[DataInsert] [nvarchar](max) NULL,
[ID] [int] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK__x_y_z_MS_HighLogIO] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
In order to reproduce a high number of transactions, I developed the following store procedure:
CREATE OR ALTER Procedure RunATest
as
DECLARE @TOP AS INT
DECLARE @TSQL AS VARCHAR(4000)
SET @TOP = CEILING(RAND()*(1-180)+180)
SET @TSQL = 'INSERT INTO [MSxyzTest].[_x_y_z_MS_HighLogIO] (dataInsert)'
SET @TSQL = @TSQL + 'values(replicate(CONVERT(NVARCHAR(4000),'
SET @TSQL = @TSQL + CHAR(39) + 'M' + CHAR(39)+ '),4000))'
exec (@tsql)
Finally, using the tool oStress, we are going to run 798 concurrent process (my business critical has 8 vCores and 800 workers limit) the store procedure 10000 times.
"C:\Program Files\Microsoft Corporation\RMLUtils\ostress.exe" -dddbb -Q"EXEC dbo.RunaTest" -Sservername.database.windows.net -Uusername -n798 -r10000 -q -l60 -Ppassword
Running this process we could see the first HADR_SYNC_COMMIT using the following query:
select req.wait_resource as Resource,
req.status, tasks.session_id, task_state,
wait_type, wait_time, req.last_wait_type, cpu_time, dop,
req.command,
blocking_session_id,
substring
(REPLACE
(REPLACE
(SUBSTRING
(ST.text
, (req.statement_start_offset/2) + 1
, (
(CASE statement_end_offset
WHEN -1
THEN DATALENGTH(ST.text)
ELSE req.statement_end_offset
END
- req.statement_start_offset)/2) + 1)
, CHAR(10), ' '), CHAR(13), ' '), 1, 512) AS statement_text,
sched.status,
* from sys.dm_exec_requests req
join sys.dm_os_workers work on req.task_address = work.task_address
join sys.dm_os_tasks tasks on req.session_id = tasks.session_id
join sys.dm_os_schedulers sched on sched.scheduler_id = tasks.scheduler_id
CROSS APPLY sys.dm_exec_sql_text(req.sql_handle) as ST
where req.status <> 'background' and req.session_id<> @@spid
AND req.wait_type LIKE 'HADR_SYNC%'
order by wait_resource,req.session_id, req.status
For every execution we could see, in the following image, that gave us, as an example, the latency per request.
Also, using the following query:
select * from sys.dm_os_wait_stats where wait_type = 'HADR_SYNC_COMMIT'
The Waiting Task Count column give us the number of single transactions done until now, if we divide by wait_time_ms migh give us the milliseconds in average that took for syncronization for secondary. In this case, 676004/372743 = 1.8 milliseconds.
Finally, our customer asked if could be possible to have information about CPU, IO and workers used. We replied that using the query SELECT * FROM SYS.DM_DB_RESOURCE_STATS you could have the following information:
Enjoy!