Recently, I worked on an interesting customer case involving intermittent query failures while fetching reporting data from a read replica.
The customer was running their database on the Azure SQL Database Hyperscale service tier, utilizing the read scale-out replica to offload reporting traffic from the primary compute node.
However, while loading reports, the application occasionally took longer than usual and then failed with the following error:
[DataSource.Error] ERROR [HY000] [Microsoft][ODBC Driver 18 for SQL Server] Unspecified error occurred on SQL Server. Connection may have been terminated by the server.
ERROR [HY000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server] The service has encountered an error processing your request. Please try again.
Error code 3947.
In this post, I’ll walk through how we analyzed the issue, identified the root cause, and the mitigation strategies that helped the customer.
Understanding Error 3947
The key indicator in the error message was Error 3947.
3947 – The transaction was aborted because the secondary compute failed to catch up redo. Retry the transaction.
This error typically occurs when the secondary compute node (read replica) cannot keep up with redo operations being replayed from the primary replica.
When this happens, the system may terminate long-running queries on the replica to prevent the replica from falling too far behind the primary.
Step 1 – Connect to the Read Replica
To begin troubleshooting, connect to the read replica using SSMS or your client tool with the following connection parameter:
ApplicationIntent=ReadOnly
This ensures the session connects to the read scale-out replica instead of the primary compute node.
Step 2 – Check for Active Blocking Sessions
Once connected to the read replica, the first step is to check whether queries are blocking each other.
The following query helps identify active requests and blocking chains.
SELECT
r.session_id,
r.status,
r.command,
r.blocking_session_id,
r.wait_type,
r.wait_time,
r.wait_resource,
r.cpu_time,
r.total_elapsed_time,
DB_NAME(r.database_id) AS database_name,
SUBSTRING(t.text,
(r.statement_start_offset/2)+1,
((CASE r.statement_end_offset
WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset END
- r.statement_start_offset)/2) + 1) AS running_statement
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id <> @@SPID
ORDER BY r.total_elapsed_time DESC;
Columns to observe:
- blocking_session_id – indicates if a query is being blocked
- wait_type – shows what the query is waiting for
- total_elapsed_time – identifies long-running queries
Step 3 – Identify Compile Locks and Schema Locks
While investigating the issue, we observed multiple compile locks and schema locks on several table objects.
Even though the database is read-only, queries still need access to metadata, and this can result in schema-related locks.
Use the following query to identify schema locks on objects.
SELECT
tl.request_session_id,
tl.resource_type,
tl.resource_associated_entity_id AS object_id,
OBJECT_NAME(tl.resource_associated_entity_id) AS object_name,
tl.request_mode,
tl.request_status
FROM sys.dm_tran_locks tl
WHERE tl.resource_type = 'OBJECT'
AND tl.request_mode LIKE '%SCH%'
ORDER BY tl.request_session_id;
Typical schema locks include:
- SCH-S – Schema Stability
- SCH-M – Schema Modification
These locks indicate metadata access or schema changes.
Step 4 – Identify Waiting Tasks
Next, check whether sessions are waiting due to schema lock contention.
SELECT
wt.session_id,
wt.wait_type,
wt.wait_duration_ms,
wt.blocking_session_id,
wt.resource_description
FROM sys.dm_os_waiting_tasks wt
WHERE wt.session_id > 50
ORDER BY wt.wait_duration_ms DESC;
Common wait types that may appear include:
- LCK_M_SCH_S
- LCK_M_SCH_M
These wait types usually indicate schema lock contention between queries and redo operations.
Why Blocking Happens on Read Replicas
In normal scenarios, read-only databases eliminate most user-generated blocking since users cannot modify data.
However, schema changes performed on the primary replica are still replayed on the read-only replica through the redo process.
Examples include:
- ALTER TABLE
- Index rebuilds
- Statistics updates
During this replay process, the redo thread temporarily acquires schema locks to maintain metadata consistency.
If a read query accesses the same object at the same time, the query may need to wait until the schema operation completes.
When Long Running Queries Block the Redo Process
Queries running on read replicas must still access metadata such as:
- Tables
- Indexes
- Statistics
In rare cases, the following scenario can occur:
- A query on the read replica acquires metadata locks.
- The primary replica executes schema changes.
- The redo process attempts to replay those changes on the secondary.
- The redo process becomes blocked by the query.
If the blocking query runs for too long, the system may terminate the query automatically to prevent the read replica from falling behind the primary.
When this occurs, the session may receive errors such as:
- Error 1219
- Error 3947
Root Cause in This Customer Scenario
In this particular case, the customer had an ETL pipeline running every 5 minutes on the primary replica.
The ETL process frequently modified database objects, which resulted in metadata changes that needed to be replayed on the read replica.
At the same time, reporting queries were running on the read replica. Occasionally, these queries conflicted with schema operations being replayed through the redo process.
This caused the redo process to lag behind, and eventually the system terminated the queries to maintain replication health, resulting in Error 3947.
Mitigation and Best Practices
While this behavior is by design, the following best practices can help minimize the impact.
Avoid Frequent Schema Changes During Peak Workloads
Frequent schema modifications increase the likelihood of blocking on read replicas.
Where possible:
- Schedule schema changes during low workload windows
- Reduce frequent metadata modifications in ETL jobs
Monitor Long Running Queries
Long-running queries increase the risk of blocking redo operations.
Use the following query to identify such queries:
SELECT
r.session_id,
r.start_time,
r.total_elapsed_time/1000 AS elapsed_seconds,
r.status,
r.wait_type,
r.blocking_session_id,
t.text AS query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id <> @@SPID
ORDER BY r.total_elapsed_time DESC;
Implement Retry Logic in Applications
Because these errors can occur intermittently, applications should implement retry logic when encountering:
- Error 3947
- Error 1219
Retrying the transaction typically succeeds once the redo process catches up.
Conclusion
Read replicas significantly improve scalability by offloading reporting workloads, but they still depend on the redo process to stay synchronized with the primary replica.
If long-running queries interfere with redo operations, the system may terminate those queries to protect replication health and availability.
By monitoring blocking, schema locks, and long-running queries on read replicas, it becomes easier to identify and mitigate these scenarios.