Troubleshoot and optimize Hyperscale named replicas performance

Microsoft

Jan 16, 2025

In Azure SQL Database Hyperscale service tier, you can add multiple named replicas to scale-out read workloads. Each named replica can have different service level objectives (SLOs) or compute sizes to meet specific workload requirements.

Recently, we identified a customer scenario where one of their named replicas was set up with a much lower compute size compared to the primary replica. During periods of high write volumes on the primary and sufficiently intense read workloads on the undersized named replica, we noticed a pattern. The limited resources on this named replica caused log records to be applied slowly, leading to a redo queue buildup. To clear the redo queue, the transaction log rate governance temporarily reduced the primary log generation rate. Additionally, the redo thread on this named replica was getting blocked due to lock contention from the read workload.

There could be multiple reasons for log rate reduction, which you can explore in this article: Performance diagnostics in Hyperscale - Azure SQL Database.

Determining log rate reduction due to readable secondary replica

Upon monitoring the waits, high RBIO_RG_REPLICA wait could be observed on the primary replica, indicating that a secondary replica is lagging in applying log records. This can occur with both named and high availability secondary replicas but is more common with undersized named replicas.

Steps to check log rate reduction:

Monitor Waits: Start by monitoring waits on the primary replica. For example, review the wait_type column in DMVs sys.dm_os_wait_stats and sys.dm_exec_requests. Frequent RBIO_RG_REPLICA waits suggest checking the log generation rate and investigating resource contention.
Check current log rate: Use the DMF sys.dm_hs_database_log_rate() to determine if the log rate is being reduced. This DMF provides details on the current max log rate, the component causing the reduction, the current catch-up rate, and the catch-up distance.
Example: The screenshot below shows the sample output of the query SELECT * FROM sys.dm_hs_database_log_rate(NULL) executed on a primary replica when one of the secondary replicas is causing a reduction in the primary log rate.
Map Replica ID: If the output of above query shows reduced log rate and the role_desc as 'Replica', you can determine the problematic replica by executing the query SELECT DATABASEPROPERTYEX(DB_NAME(),'REPLICAID') against your named replicas.
Note: DATABASEPROPERTYEX(DB_NAME(),'REPLICAID') will provide the replica ID of the replica in which you execute the query.

Steps to avoid log rate reduction due to an undersized named replica:

Once you have identified that the log rate is being reduced due to a named replica, you can check the SLO of the replica by querying sys.database_service_objectives in the context of a named replica and consider scaling it up to allocate more resources to this replica to aid in reducing the catch-up distance.

In some situations, customers create named replicas for future reporting requirements or for unpredictable workload patterns. In those scenarios, you may explore/configure serverless compute for named replicas so that they can scale automatically to meet the workload requirements. (Yes, you can configure the named replica as serverless compute even if your primary replica is provisioned compute.)

Note: The workload owner must decide if high data latency on the named replica and reduction of log rate on the primary replica is acceptable. For some workloads, immediate data availability may necessitate upgrading to a higher SLO. For others, such as reporting, higher latency might be acceptable. The decision hinges on whether to reduce costs at the expense of higher latency or to minimize latency at a higher cost, especially if CPU availability on the named replica is a bottleneck in applying log records.

References:

Named Replicas:

Isolated Access for named replica

Enhancing Resilience in Azure SQLDB Hyperscale Named Replicas

Zone redundancy support for named replica

Maintenance window for named replica

Updated Jan 16, 2025

Version 1.0