azure sql db
62 TopicsWhy PITR Restore for Azure SQL Hyperscale Can Take Longer Than Expected
Azure SQL Database Hyperscale is designed to deliver fast, storage‑optimized backup and restore operations. According to Microsoft documentation, most Point‑in‑Time Restore (PITR) operations for Hyperscale should complete within 10 minutes, regardless of database size, because the service uses metadata-based restore techniques rather than copying full data files. However, some real‑world scenarios can lead to unexpectedly long restore times, even when the source database is Hyperscale and even when no obvious configuration changes are made. This article explains one such scenario, outlines what happened, and provides guidance to avoid similar delays in the future. Expected Behavior for Hyperscale PITR Hyperscale databases use a unique architecture that separates compute and storage. Backups are taken from storage snapshots and do not require data copying during a typical PITR restore. From Microsoft Learn: “Most restores complete within minutes, even for large databases.” Ref: https://learn.microsoft.com/azure/azure-sql/database/hyperscale-automated-backups-overview?view=azuresql#backup-and-restore-performance This performance expectation applies as long as the Backup Storage Redundancy remains the same between the source DB and the target restore. Customer Scenario Overview A customer initiated PITR restore operations for Hyperscale DB: Source DB: Hyperscale (SLO: HS_PRMS_64) Target DB: Hyperscale (SLO: HS_PRMS_128) Same logical server Source DB Backup Storage Redundancy: Standard_RAGRS Customer enabled Zone Redundancy for the target database during restore The customer therefore expected the restore to finish within the normal Hyperscale window (~10 minutes). Instead, the restore took significantly longer. Why the Restore Took Longer Although the source database used Standard_RAGRS, enabling Zone Redundancy at restore time introduced a configuration change that affected the underlying Backup Storage Redundancy (BSR) for the newly restored database. 🔍 Key Point: Changing BSR Creates a Full "Size-of-Data" Restore When the target DB uses a different BSR type than the source DB, Azure SQL Database cannot perform a fast metadata-based restore. Instead, it must perform a full data copy, and the restore becomes proportional to the database size: More data → longer restore Effectively behaves like a physical data movement operation This overrides Hyperscale’s normally fast PITR workflow This behavior is documented here: https://learn.microsoft.com/azure/azure-sql/database/hyperscale-automated-backups-overview?view=azuresql#backup-and-restore-performance In the customer’s case: Customer-enabled Zone Redundancy changed the restore workflow. As a result, the system selected a backup storage redundancy configuration different than the source: Restore workflow chose: Standard_RAGZRS Source database actually used: Standard_RAGRS (non‑zone‑redundant) This mismatch triggered a size-of-data restore, leading to the observed delay. Summary of Root Cause ✔ Hyperscale PITR is fast only when BSR is unchanged ✔ Customer enabled Zone Redundant configuration during restore ✔ This resulted in a different Backup Storage Redundancy from the source ✔ Target restore had to perform a full data copy, not metadata-based restore ✔ Restore time scaled with database size → leading to long restore duration Key Takeaways 1. Do not change Backup Storage Redundancy during PITR unless necessary Any change (e.g., RAGRS → RAGZRS) converts the restore into a size‑of‑data operation. 2. Restores that involve cross‑region or cross‑redundancy conversions always take longer This applies equally to: PITR restore Restore to another server Restore with SLO changes Restore involving ZRS/RA‑GZRS transitions 3. Hyperscale PITR is extremely fast—when configuration is unchanged If the source and target BSR match, Hyperscale restores usually complete in minutes. 4. Enabling Zone Redundancy is valid, but do it after the restore If the customer wants ZRS for the restored DB: Perform PITR first (fast restore) Then update redundancy post‑restore (online operation) Conclusion While Hyperscale PITR restores are typically very fast, configuration changes during the restore—especially related to Backup Storage Redundancy—can trigger a full data copy and significantly increase restore duration. To get the best performance: Keep the same BSR when performing PITR Apply redundancy changes after the restore completes Use metadata-based restores whenever possible Understanding these nuances helps ensure predictable recovery times and aligns operational processes with Hyperscale’s architectural design.When and How to Update Statistics in Azure SQL Database
Accurate statistics are a cornerstone of good query performance in Azure SQL Database. While the platform automatically manages statistics in most scenarios, there are real‑world cases where manual intervention is not only recommended—but essential. This article explains when, why, and how to update statistics in Azure SQL Database, with practical samples and real customer scenarios drawn from production support cases. Microsoft Learn reference (overview): https://learn.microsoft.com/sql/relational-databases/statistics/statistics Why Statistics Matter SQL Server’s query optimizer relies on statistics to estimate row counts, choose join strategies, allocate memory grants, and decide whether to run operations in parallel. When statistics are stale or inaccurate, even well‑indexed queries can suddenly degrade. In Azure SQL Database: AUTO_CREATE_STATISTICS is enabled and managed by the platform AUTO_UPDATE_STATISTICS runs asynchronously Because updates are async, queries may continue running with outdated cardinality estimates until statistics refresh completes. https://learn.microsoft.com/en-us/sql/relational-databases/query-processing-architecture-guide When Manual Statistics Updates Are Required 1. After Large Data Changes (ETL / Batch Operations) Customer scenario A nightly ETL job bulk‑inserts millions of rows into a fact table. The following morning, reports time out and logical reads spike. Why it happens Auto‑update thresholds are based on row‑count changes and may not trigger immediately—especially for append‑only or skewed data. Recommended action UPDATE STATISTICS dbo.FactSales; Target only the critical statistic if known: UPDATE STATISTICS dbo.FactSales (IX_FactSales_CreatedDate); 2. Query Plan Regression Without Schema Changes Customer scenario A stable query suddenly switches from a Nested Loops join to a Hash Join, increasing CPU usage and BUFFERIO waits. Root cause Statistics no longer reflect current data distribution. Recommended action UPDATE STATISTICS dbo.Customer WITH FULLSCAN; Learn more: https://learn.microsoft.com/sql/relational-databases/statistics/update-statistics 3. After Restore Operations (PITR / Geo‑Restore / Database Copy) Customer scenario After a Point‑in‑Time Restore (PITR) on a Hyperscale database, queries run slower despite healthy platform telemetry. Why it happens Statistics are restored as‑is, but workload patterns often change after the restore point. Auto‑update statistics may lag behind. Recommended action EXEC sp_updatestats; Prioritize heavily accessed tables first on large databases. Learn more: https://learn.microsoft.com/azure/azure-sql/database/recovery-using-backups Query Store Comparison: Before vs After Updating Statistics One of the most effective ways to validate the impact of statistics updates is Query Store. Before update (typical signs): Sudden plan change for the same query text Increased logical reads and CPU time Change in join strategy or memory grant After statistics update: Optimizer selects a more efficient plan Logical reads reduced CPU and duration stabilize Example workflow -- Capture runtime stats SELECT * FROM sys.query_store_runtime_stats WHERE query_id = <QueryID>; -- Update statistics UPDATE STATISTICS dbo.Orders; -- Force recompilation EXEC sp_recompile 'dbo.Orders'; Query Store reference: https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store Decision Flow: When Should I Update Statistics? Performance regression observed? | v Query plan changed without schema change? | Yes | v Recent data change / restore / ETL? | Yes | v Update targeted statistics If NO at any step, rely on automatic statistics and continue monitoring. What NOT to Do ❌ Do not run blanket WITH FULLSCAN on all tables FULLSCAN is CPU and IO expensive, especially on large or Hyperscale databases. ❌ Do not schedule frequent database‑wide sp_updatestats jobs This can introduce unnecessary workload and plan churn. ❌ Do not update statistics blindly without investigation Always validate plan regression or stale estimates using Query Store or execution plans. Checking Statistics Freshness SELECT OBJECT_NAME(s.object_id) AS table_name, s.name AS stats_name, sp.last_updated, sp.rows, sp.rows_sampled FROM sys.stats s CROSS APPLY sys.dm_db_stats_properties(s.object_id, s.stats_id) sp ORDER BY sp.last_updated; DMV reference: https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-stats-properties-transact-sql Best Practices Summary ✅ Prefer targeted statistics updates ✅ Update stats after bulk data changes or restores ✅ Validate results using Query Store ✅ Avoid unnecessary FULLSCAN operations ✅ Use stats updates as a diagnostic and remediation step, not routine maintenance Conclusion Although Azure SQL Database manages statistics automatically, asynchronous updates and changing workload patterns can result in sub‑optimal query plans. Manually updating statistics after significant data changes, restore operations, or observed plan regressions is a safe and effective best practice to restore optimal query performance.200Views0likes0CommentsGeo‑Replication Redo Lag in Azure SQL Database
Monitoring and Troubleshooting Using Public DMVs Azure SQL Database provides built‑in high availability and geo‑replication capabilities to ensure database resilience and business continuity. While replication is fully managed by the platform, customers may occasionally observe a delay between the primary and geo‑replicated secondary database, especially during periods of heavy write activity. This article provides a public, supported approach to monitoring and understanding geo‑replication delay caused by redo lag, using official Dynamic Management Views (DMVs) and T‑SQL only, without relying on internal tools. Scenario Overview Customers may report that changes committed on the primary database are not immediately visible on the geo‑replicated secondary, sometimes for several minutes. Typical symptoms include: Reporting queries on the geo‑secondary showing stale data Increased redo catch‑up time on the secondary No performance degradation observed on the primary Replication eventually catches up without manual intervention This behavior is commonly associated with redo lag, where the secondary has already received the log records but requires additional time to replay them into data pages. Understanding Geo‑Replication Redo Lag In Azure SQL Database geo‑replication: Transactions are sent from the primary to the secondary Log records are hardened on the secondary The secondary applies these records asynchronously to its data files (redo) If the secondary experiences temporary pressure (for example, due to schema changes or intensive operations), redo may fall behind, causing the secondary to lag—even though data durability remains intact. Step 1: Check Geo‑Replication Status and Lag The primary DMV for monitoring geo‑replication is: 📘 sys.dm_geo_replication_link_status Public documentation: https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database Sample T‑SQL Query SELECT database_id, partner_server, partner_database, replication_state_desc, replication_lag_sec, last_replication_time FROM sys.dm_geo_replication_link_status; Key Columns Explained Column Description replication_state_desc Current replication state replication_lag_sec Estimated lag (in seconds) last_replication_time Last successful replication timestamp (UTC) partner_server Geo‑replica logical server Interpretation Healthy replication: replication_lag_sec = 0 and state is healthy Transient delay: Lag increases temporarily but later returns to zero Sustained delay: Lag remains elevated for an extended period and should be investigated further Step 2: Monitor Local Replica Redo Health To understand whether lag is related to redo activity on the secondary, you can query: 📘 sys.dm_database_replica_states Public documentation: https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database Sample T‑SQL Query SELECT database_id, role_desc, synchronization_state_desc, synchronization_health_desc, redo_queue_size, redo_rate, last_commit_time from sys.dm_database_replica_states Key Insights redo_queue_size Indicates how much log data is pending replay on the secondary redo_rate Shows how quickly redo is being applied last_commit_time Helps estimate data freshness on the secondary (UTC) Interpretation Scenario Observation Normal operation redo_queue_size = 0 Write burst redo_queue_size increases temporarily Recovery in progress redo_rate remains positive Healthy state synchronization_health_desc = HEALTHY Short‑lived spikes are expected platform behavior and usually resolve automatically. Practical Monitoring Considerations Reporting Workloads If applications read from the geo‑secondary for reporting: Expect near‑real‑time, not guaranteed real‑time visibility Design reports to tolerate small delays Route latency‑sensitive reads to the primary if required Transaction Patterns Redo lag is more likely during: Large batch updates Index maintenance operations Schema modification commands Bursty write workloads Keeping transactions short and efficient reduces replication pressure. Best Practices Use UTC timestamps consistently when correlating events Monitor replication_lag_sec and redo_queue_size together Implement retry logic in applications for transient conditions Avoid assuming read replicas are always perfectly synchronized Do not take manual actions during short‑lived redo spikes unless the lag persists Summary Redo lag in Azure SQL Database geo‑replication is a normal and self‑healing behavior during certain workload patterns. By using supported public DMVs and T‑SQL, customers can: Monitor replication health safely Understand replica freshness Make informed application routing decisions Avoid unnecessary intervention Azure SQL Database automatically stabilizes replication once redo pressure subsides, ensuring durability and availability without manual management. References Azure SQL Database High Availability https://learn.microsoft.com/azure/azure-sql/database/high-availability-overview sys.dm_geo_replication_link_status https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database sys.dm_database_replica_states https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-databaseAzure SQL Database High Availability: Architecture, Design, and Built‑in Resilience
High availability (HA) is a core pillar of Azure SQL Database. Unlike traditional SQL Server deployments—where availability architectures must be designed, implemented, monitored, and maintained manually—Azure SQL Database delivers built‑in high availability by design. By abstracting infrastructure complexity while still offering enterprise‑grade resilience, Azure SQL Database enables customers to achieve strict availability SLAs with minimal operational overhead. In this article, we’ll cover: Azure SQL Database high‑availability design principles How HA is implemented across service tiers: General Purpose Business Critical Hyperscale Failover behavior and recovery mechanisms Architecture illustrations explaining how availability is achieved Supporting Microsoft Learn and documentation references What High Availability Means in Azure SQL Database High availability in Azure SQL Database ensures that: Databases remain accessible during infrastructure failures Hardware, software, and network faults are handled automatically Failover occurs without customer intervention Data durability is maintained using replication, quorum, and consensus models This is possible through the separation of: Compute Storage Control plane orchestration Azure SQL Database continuously monitors health signals across these layers and automatically initiates recovery or failover when required. Azure SQL Database High Availability – Shared Concepts Regardless of service tier, Azure SQL Database relies on common high‑availability principles: Redundant replicas Synchronous and asynchronous replication Automatic failover orchestration Built‑in quorum and consensus logic Transparent reconnect via the Azure SQL Gateway Applications connect through the Azure SQL Gateway, which automatically routes traffic to the current primary replica—shielding clients from underlying failover events. High Availability Architecture – General Purpose Tier The General-Purpose tier uses a compute–storage separation model, relying on Azure Premium Storage for data durability. Key Characteristics Single compute replica Storage replicated three times using Azure Storage Read‑Access Geo‑Redundant Storage (RA‑GRS) optional Stateless compute that can be restarted or moved Fast recovery using storage reattachment Architecture Diagram – General Purpose Tier Description: Clients connect via the Azure SQL Gateway, which routes traffic to the primary compute node. The compute layer is stateless, while Azure Premium Storage provides triple‑replicated durable storage. Failover Behavior Compute failure triggers creation of a new compute node Database files are reattached from storage Typical recovery time: seconds to minutes 📚 Reference: https://learn.microsoft.com/azure/azure-sql/database/service-tier-general-purpose High Availability Architecture – Business Critical Tier The Business-Critical tier is designed for mission‑critical workloads requiring low latency and fast failover. Key Characteristics Multiple replicas (1 primary + up to 3 secondaries) Always On availability group–like architecture Local SSD storage on each replica Synchronous replication Automatic failover within seconds Architecture Diagram – Business Critical Tier Description: The primary replica synchronously replicates data to secondary replicas. Read‑only replicas can offload read workloads. Azure SQL Gateway transparently routes traffic to the active primary replica. Failover Behavior If the primary replica fails, a secondary is promoted automatically No storage reattachment is required Client connections are redirected automatically Typical failover time: seconds 📚 Reference: https://learn.microsoft.com/azure/azure-sql/database/service-tier-business-critical High Availability Architecture – Hyperscale Tier The Hyperscale tier introduces a distributed storage and compute architecture, optimized for very large databases and rapid scaling scenarios. Key Characteristics Decoupled compute and page servers Multiple read replicas Fast scale‑out and fast recovery Durable log service ensures transaction integrity Architecture Diagram – Hyperscale Tier Description: The compute layer processes queries, while durable log services and distributed page servers manage data storage independently, enabling rapid failover and scaling. Failover Behavior Compute failure results in rapid creation of a new compute replica Page servers remain intact Log service ensures zero data loss 📚 Reference: https://learn.microsoft.com/azure/azure-sql/database/service-tier-hyperscale How Azure SQL Database Handles Failures Azure SQL Database continuously monitors critical health signals, including: Heartbeats IO latency Replica health Storage availability Automatic Recovery Actions Restarting failed processes Promoting secondary replicas Recreating compute nodes Redirecting client connections Applications should implement retry logic and transient‑fault handling to fully benefit from these mechanisms. 📚 Reference: https://learn.microsoft.com/azure/architecture/best-practices/transient-faults Zone Redundancy and High Availability Azure SQL Database can be configured with zone redundancy, distributing replicas across Availability Zones in the same region. Benefits Protection against datacenter‑level failures Increased SLA Transparent resilience without application changes 📚 Reference: https://learn.microsoft.com/azure/azure-sql/database/high-availability-sla Summary Azure SQL Database delivers high availability by default, removing the traditional operational burden associated with SQL Server HA designs. Service Tier HA Model Typical Failover General Purpose Storage‑based durability Minutes Business Critical Multi‑replica, synchronous Seconds Hyperscale Distributed compute & storage Seconds By selecting the appropriate service tier and enabling zone redundancy where required, customers can meet even the most demanding availability and resilience requirements with minimal complexity. Additional References Azure SQL Database HA overview https://learn.microsoft.com/azure/azure-sql/database/high-availability-overview Azure SQL Database SLAs https://azure.microsoft.com/support/legal/sla/azure-sql-database Application resiliency guidance https://learn.microsoft.com/azure/architecture/framework/resiliency315Views0likes0CommentsMissing MSchange_tracking_history Table in SQL Server: Causes, Cleanup Behavior
Change Tracking (CT) in SQL Server is a lightweight mechanism for tracking data changes without the overhead of full auditing or CDC. However, when troubleshooting CT cleanup behavior, one issue that often surprises engineers is the absence of the MSchange_tracking_history table—a system table expected to exist when auto-cleanup is running normally. In this blog, we walk through the scenarios that can lead to a missing history table, what to validate in your environment, and how to collect the right data for deeper investigation. Confirming Whether Auto-Cleanup Has Run Successfully When diagnosing Change Tracking cleanup issues, the first step is verifying whether the auto‑cleanup process has executed at least once on the affected database. Here are key questions to ask: ✔ Was the auto‑cleanup process ever triggered? SQL Server's CT cleanup runs automatically in the background based on internal scheduling. You should validate whether it has executed at least once since CT was enabled (or re‑enabled). ✔ Were there rows outside the retention period at the time cleanup ran? The cleanup task only deletes rows older than the configured retention period. If no such rows existed, cleanup may have run—but removed nothing. ✔ Did the cleanup task complete successfully? It’s important to know whether cleanup: Started, Scanned eligible rows, Deleted entries, and Completed without interruption. If cleanup is running but not making progress, SQL Server provides internal diagnostics that can help determine whether the process is blocked, paused, or skipped. Why the MSchange_tracking_history Table May Be Missing Microsoft’s Product Team has clarified an important behavioral detail: ➡ The MSchange_tracking_history table is created only by the auto-cleanup process. This means the table may be missing if: 🔸 CT was disabled previously If Change Tracking was turned off at any point, SQL Server removes internal structures associated with CT. 🔸 CT was later re-enabled, but auto-cleanup never ran Re‑enabling CT does not immediately recreate the history table. It will only be created once auto-cleanup executes naturally for the first time. 🔸 Only manual cleanup was performed Manual cleanup (using internal stored procedures or scripts) does not generate the MSchange_tracking_history table. Only the automated engine-driven cleanup does. ➡ Resolution: Ensuring that the auto-cleanup process runs at least once is typically enough to recreate the missing history table. How to Collect Deeper Diagnostics: SQL Server Profiler Trace For deeper analysis—especially when the cleanup process appears stalled, not deleting rows, or not creating the history table—you should collect a SQL Server Profiler trace that captures a full auto-cleanup cycle. What the trace helps with: Verifying which CT internal procedures are being executed. Determining whether the cleanup scans and deletion operations occur. Identifying blocking, timeouts, or internal errors. Confirming if the engine attempts to initialize or update the history table. When to collect the trace: Run it during an auto-cleanup cycle. Ensure that the cycle actually deletes records, not just scans without action. This allows the Product Team or DBA staff to analyze real cleanup behavior from the engine side. T‑SQL: Check Change Tracking Cleanup Status These queries help you verify whether auto‑cleanup has been running, whether it’s enabled, and whether there are expired records (outside your retention window) still present in CT side tables or syscommittab. The approach aligns with Microsoft’s troubleshooting guidance. Is auto‑cleanup enabled? What’s the retention? -- Run in the target database SELECT db.database_id, DB_NAME(db.database_id) AS database_name, db.is_auto_cleanup_on, -- 1 = ON, 0 = OFF db.retention_period, db.retention_period_units -- 1 = MINUTES, 2 = HOURS, 3 = DAYS FROM sys.change_tracking_databases AS db WHERE db.database_id = DB_ID(); This view reports whether auto‑cleanup is ON and shows the retention. 2) Has auto‑cleanup run recently? (History table) -- History is created only when auto‑cleanup runs SELECT TOP (1000) start_time, end_time, cleanup_version, -- engine cleanup version recorded comments FROM dbo.MSchange_tracking_history ORDER BY start_time DESC; If dbo.MSchange_tracking_history doesn’t exist or is empty, auto‑cleanup most likely has never run (or hasn’t run since CT was re‑enabled). The table itself is created by the auto‑cleanup process. Note: The missing MSchange_tracking_history table can occur if CT was disabled and re‑enabled but auto‑cleanup hasn’t run since; manual cleanup alone won’t create it. A single successful auto‑cleanup cycle will create the table. 3) Are there rows outside the retention window? (Per‑table scan) The following script checks each CT‑enabled table and estimates whether its side table contains rows older than the current min valid version (i.e., candidates for cleanup): /* This script enumerates CT-enabled tables, derives the min_valid_version, and checks whether rows exist with SYS_CHANGE_VERSION < min_valid_version. It uses dynamic SQL because each CT side table is an internal table named: change_tracking_<object_id> See Microsoft guidance on CT side tables. */ SET NOCOUNT ON; DECLARE @sql NVARCHAR(MAX) = N''; DECLARE @CRLF NCHAR(2) = NCHAR(13) + NCHAR(10); ;WITH ct AS ( SELECT t.object_id, QUOTENAME(OBJECT_SCHEMA_NAME(t.object_id)) + N'.' + QUOTENAME(OBJECT_NAME(t.object_id)) AS table_name, t.min_valid_version FROM sys.change_tracking_tables AS t ) SELECT @sql = STRING_AGG(CONVERT(NVARCHAR(MAX), N' BEGIN TRY DECLARE SYSNAME = N''' + table_name + N'''; DECLARE @oid INT = ' + CAST(object_id AS NVARCHAR(20)) + N'; DECLARE @side SYSNAME = N''change_tracking_'' + CAST(@oid AS NVARCHAR(20)); DECLARE @minver BIGINT = ' + CAST(min_valid_version AS NVARCHAR(20)) + N'; -- Build dynamic statement to count expired rows in the side table DECLARE @dyn NVARCHAR(MAX) = N'' SELECT ''''''+@tbl+'''''' AS ct_table, ''''''+@side+'''''' AS side_table, @minver AS min_valid_version, COUNT(*) AS expired_rows FROM '' + QUOTENAME(@side) + N'' WHERE SYS_CHANGE_VERSION < @minver''; DECLARE @params NVARCHAR(400) = N''@minver BIGINT''; EXEC sp_executesql @dyn, @params, @minver = @minver; END TRY BEGIN CATCH SELECT ''' + table_name + N''' AS ct_table, ERROR_NUMBER() AS error_number, ERROR_MESSAGE() AS error_message; END CATCH' ), @CRLF ) FROM ct; -- Execute the generator EXEC sys.sp_executesql @sql; How it works: CT stores changes in internal “side tables” (named change_tracking_<object_id>), and the minimum valid version per CT‑enabled table is tracked in sys.change_tracking_tables. Rows with SYS_CHANGE_VERSION below min_valid_version are eligible for cleanup and should be deleted by auto‑cleanup; the script surfaces a count of such rows. 4) Check the commit table for expired records (syscommittab) Auto‑cleanup also clears entries in the commit table (syscommittab) once they fall outside retention. This quick check compares the latest commit version with your tables’ min_valid_version to see if commit entries are lagging: -- Approximate commit backlog vs. min_valid_version SELECT DB_NAME() AS database_name, MAX(ctv.commit_ts) AS latest_commit_ts, MAX(ctv.commit_lsn) AS latest_commit_lsn FROM sys.syscommittab AS ctv; -- internal commit table used by CT If you see a large backlog in syscommittab relative to table min_valid_version, it can indicate cleanup isn’t keeping up or hasn’t run. (Microsoft’s troubleshooting doc lists syscommittab growth as a symptom when cleanup is stalled.) 5) Optional: Force a manual cleanup (diagnostics only) Use manual cleanup with caution and preferably when AUTO_CLEANUP = OFF, to avoid contention with the background task. -- Flush commit table in batches (example batch size = 100,000) EXEC sys.sp_flush_commit_table_on_demand _batch_size = 100000; -- Flush a specific CT internal table (replace with your user table name) EXEC sys.sp_flush_CT_internal_table_on_demand @table_name = N'dbo.YourTable'; Manual cleanup procedures are documented in community guidance and Microsoft KBs; prefer auto‑cleanup for normal operation. Relevant Behaviors & Known Issues History table creation: dbo.MSchange_tracking_history is created only by the auto‑cleanup process. If CT was disabled and later re‑enabled but only manual cleanup has run since, the history table won’t exist until one successful auto‑cleanup cycle completes. [learn.microsoft.com] “Auto‑cleanup runs but isn’t making progress”: Follow Microsoft’s troubleshooting flow; symptoms include many rows outside retention in side tables or syscommittab. [learn.microsoft.com] Fixes and trace flags: Certain builds of SQL Server 2016/2017/2019 had issues where cleanup versions could go negative or overflow INT in the history table. Microsoft provides KBs and guidance (e.g., TF 8290) to reset invalid cleanup versions and CU updates to fix history table datatype problems. If you suspect these issues, review the KBs and apply the relevant CUs Summary A missing MSchange_tracking_history table often indicates that Change Tracking was disabled at some point and auto-cleanup has not run since re-enabling it. Auto-cleanup is the only mechanism that creates this table—manual cleanup will not. To resolve or further investigate: Confirm that auto-cleanup has executed at least once. Validate whether cleanup deleted rows beyond the retention period. Capture a SQL Server Profiler trace for a successful cleanup cycle. This ensures complete visibility into the CT engine behavior and helps determine whether remediation is needed. Official Reference Microsoft Learn — Troubleshoot change tracking auto‑cleanup issues (status, history table, retention, syscommittab symptoms): learn.microsoft.com/sql/.../cleanup-and-troubleshoot-change-tracking-sql-server' Microsoft Support — KB4538365 (negative cleanup versions; TF 8290 guidance): support.microsoft.com/.../kb4538365 Microsoft Support — KB5007039 (INT overflow in MSchange_tracking_history cleanup_version; CU fix): support.microsoft.com/.../kb5007039 Community references explaining CT side tables and cleanup mechanics: mssqltips.com — Change Tracking performance troubleshooting Stack Overflow — manual cleanup examples & cautions172Views0likes0CommentsAzure SQL Database Data Sync Retirement: Migration Scenarios and Recommended Alternatives
Azure SQL Database Data Sync has long been used to keep data consistent across multiple Azure SQL databases. However, as the service moves toward retirement, many customers are now asking an important question: How do we replace Azure SQL Data Sync with a supported, future‑proof solution—without significant data loss or downtime? In this article, we’ll walk through: What Data Sync retirement means in practice Typical migration challenges A real-world customer scenario Recommended alternatives A step‑by‑step migration approach from DEV to PROD Useful references and documentation Why Azure SQL Data Sync Retirement Matters Azure SQL Data Sync relies on: Triggers Metadata tables Hub-and-spoke topology While functional, this architecture introduces complexity, performance overhead, and operational risks, especially as data volumes and workloads grow. Microsoft’s long-term direction favors scalable, resilient, and observable data integration services, such as Azure Data Factory (ADF) and event-driven replication patterns. If you are currently using Data Sync, planning a migration early is strongly recommended. Official guidance: https://learn.microsoft.com/azure/azure-sql/database/sql-data-sync-data-sql-server-sql-database Sample Customer Scenario Let’s consider a real scenario commonly seen in the field: 4 Azure SQL Databases Subscription: Contoso-DEV Current topology: Azure SQL Data Sync Target state: Consolidate all data into one Azure SQL Database Environment flow: DEV → UAT → PROD Database tiers: Standard (S0 / S1) Size: Below 250 GB per database Key requirements: Minimal data loss Quick replication Azure-native and supported replacement Clear operational model Migration Design Considerations Before selecting a tool, several factors must be evaluated: ✅ Latency tolerance (near real-time vs scheduled sync) ✅ Write patterns (conflicts, bidirectional vs unidirectional) ✅ Schema compatibility ✅ Operational overhead ✅ Long-term supportability For most consolidation scenarios, unidirectional replication (many → one) provides the best balance of simplicity and reliability. Diagram 1: Current State – Azure SQL Data Sync (Before Retirement) This diagram represents the existing topology, where multiple databases are synchronized using Azure SQL Data Sync into a single consolidated database. Characteristics Trigger‑based synchronization Additional metadata tables Limited observability Service approaching retirement Diagram 2: Target State – Azure Data Factory Based Consolidation This diagram shows the recommended replacement architecture using Azure Data Factory. Advantages No triggers or sync metadata tables Parallel ingestion Built‑in retry, monitoring, and alerting Fully supported and future‑proof Diagram 3: Incremental Replication Logic (ADF) This diagram explains how minimal data loss is achieved using incremental replication. Key Points No continuous connection required Typical RPO: 1–5 minutes Safe restart after failures Diagram 4: DEV → PROD Migration Flow This diagram highlights the recommended rollout approach starting with POC in DEV. Best Practices Build once, reuse across environments Parameterize connection strings Enable monitoring before PROD cutover Recommended Alternatives to Azure SQL Data Sync ✅ Option 1: Azure Data Factory (ADF) – Primary Recommendation Azure Data Factory provides a fully supported and scalable replacement for Data Sync when consolidating databases. Architecture Overview One pipeline per source database Initial full load Incremental replication using: Change Tracking, or CDC (if applicable), or Watermark columns (ModifiedDate / identity) Why ADF? Microsoft’s strategic data integration platform Built-in monitoring and retry logic Parallel ingestion Schema mapping and transformation support 📌 Best fit when: You need consolidation Near real‑time (minutes) is acceptable You want a future‑proof design 📘 References: https://learn.microsoft.com/azure/data-factory/copy-activity-overview https://learn.microsoft.com/azure/data-factory/incremental-copy-overview https://learn.microsoft.com/azure/data-factory/connector-azure-sql-database ⚠️ Option 2: SQL Transactional Replication (Limited Use) Transactional replication can still work in narrow scenarios, but: Adds operational complexity Limited flexibility for schema changes Not recommended for new designs 📘 Reference: https://learn.microsoft.com/azure/azure-sql/database/replication-to-sql-database 🧭 Option 3: Azure SQL Managed Instance Link (Future‑Facing) If your long-term roadmap includes Azure SQL Managed Instance, the MI Link feature enables near real-time replication. However: Not applicable if your target remains Azure SQL Database Requires infrastructure change 📘 Reference: https://learn.microsoft.com/azure/azure-sql/managed-instance/link-feature Recommended Migration Approach (DEV → PROD) Phase 1 – Assessment Review schema overlaps and key conflicts Identify identity and primary key strategies Confirm availability of: Change Tracking ModifiedDate / watermark columns 📘 Change Tracking: https://learn.microsoft.com/sql/relational-databases/track-changes/about-change-tracking-sql-server Phase 2 – Initial Seeding (DEV) Use ADF Copy Activity for full loads Ingest each source DB into: Dedicated schemas, or Logical partitions Validate: Row counts Referential integrity Performance impact Phase 3 – Incremental Replication Enable incremental pipelines Recommended frequency: every 1–5 minutes Use parallelism for scalability Simulate Data Sync behavior without triggers Phase 4 – Cutover Optional short write freeze Final delta sync Application validation Promote pipelines to PROD Data Loss and Performance Expectations Metric Expected Outcome RPO Minutes (configurable) Downtime Near‑zero Performance impact Predictable and controllable Observability Built‑in via ADF monitoring Final Recommendation Summary ✅ Azure Data Factory with initial full load + incremental replication ✅ Azure-native, strategic, and supported ✅ Ideal for Data Sync retirement scenarios ✅ Scales from DEV to PROD with minimal redesign Azure SQL Data Sync retirement is an opportunity—not a setback. With services like Azure Data Factory, customers can move toward: Better observability Cleaner architectures Easier production operations Long-term platform alignment If you are still relying on Azure SQL Data Sync, now is the right time to assess, plan, and migrate. Helpful Resources Azure SQL Data Sync overview https://learn.microsoft.com/azure/azure-sql/database/sql-data-sync-data-sql-server-sql-database Azure Data Factory incremental copy https://learn.microsoft.com/azure/data-factory/incremental-copy-overview Azure SQL change tracking https://learn.microsoft.com/sql/relational-databases/track-changes/about-change-tracking-sql-server145Views0likes0CommentsAzure SQL Database Data Sync Retirement: Migration Scenarios and Recommended Alternatives
Why Azure SQL Data Sync Retirement Matters Azure SQL Data Sync relies on: Triggers Metadata tables Hub-and-spoke topology While functional, this architecture introduces complexity, performance overhead, and operational risks, especially as data volumes and workloads grow. Microsoft’s long-term direction favors scalable, resilient, and observable data integration services, such as Azure Data Factory (ADF) and event-driven replication patterns. If you are currently using Data Sync, planning a migration early is strongly recommended. Official guidance: https://learn.microsoft.com/azure/azure-sql/database/sql-data-sync-data-sql-server-sql-database Sample Customer Scenario Let’s consider a real scenario commonly seen in the field: 4 Azure SQL Databases Subscription: Contoso-DEV Current topology: Azure SQL Data Sync Target state: Consolidate all data into one Azure SQL Database Environment flow: DEV → UAT → PROD Database tiers: Standard (S0 / S1) Size: Below 250 GB per database Key requirements: Minimal data loss Quick replication Azure-native and supported replacement Clear operational model Migration Design Considerations Before selecting a tool, several factors must be evaluated: ✅ Latency tolerance (near real-time vs scheduled sync) ✅ Write patterns (conflicts, bidirectional vs unidirectional) ✅ Schema compatibility ✅ Operational overhead ✅ Long-term supportability For most consolidation scenarios, unidirectional replication (many → one) provides the best balance of simplicity and reliability. Diagram 1: Current State – Azure SQL Data Sync (Before Retirement) This diagram represents the existing topology, where multiple databases are synchronized using Azure SQL Data Sync into a single consolidated database. Characteristics Trigger‑based synchronization Additional metadata tables Limited observability Service approaching retirement Diagram 2: Target State – Azure Data Factory Based Consolidation This diagram shows the recommended replacement architecture using Azure Data Factory. Advantages No triggers or sync metadata tables Parallel ingestion Built‑in retry, monitoring, and alerting Fully supported and future‑proof Diagram 3: Incremental Replication Logic (ADF) This diagram explains how minimal data loss is achieved using incremental replication. Key Points No continuous connection required Typical RPO: 1–5 minutes Safe restart after failures Diagram 4: DEV → PROD Migration Flow This diagram highlights the recommended rollout approach starting with POC in DEV. Best Practices Build once, reuse across environments Parameterize connection strings Enable monitoring before PROD cutover Recommended Alternatives to Azure SQL Data Sync ✅ Option 1: Azure Data Factory (ADF) – Primary Recommendation Azure Data Factory provides a fully supported and scalable replacement for Data Sync when consolidating databases. Architecture Overview One pipeline per source database Initial full load Incremental replication using: Change Tracking, or CDC (if applicable), or Watermark columns (ModifiedDate / identity) Why ADF? Microsoft’s strategic data integration platform Built-in monitoring and retry logic Parallel ingestion Schema mapping and transformation support 📌 Best fit when: You need consolidation Near real‑time (minutes) is acceptable You want a future‑proof design 📘 References: https://learn.microsoft.com/azure/data-factory/copy-activity-overview https://learn.microsoft.com/azure/data-factory/incremental-copy-overview https://learn.microsoft.com/azure/data-factory/connector-azure-sql-database ⚠️ Option 2: SQL Transactional Replication (Limited Use) Transactional replication can still work in narrow scenarios, but: Adds operational complexity Limited flexibility for schema changes Not recommended for new designs 📘 Reference: https://learn.microsoft.com/azure/azure-sql/database/replication-to-sql-database 🧭 Option 3: Azure SQL Managed Instance Link (Future‑Facing) If your long-term roadmap includes Azure SQL Managed Instance, the MI Link feature enables near real-time replication. However: Not applicable if your target remains Azure SQL Database Requires infrastructure change 📘 Reference: https://learn.microsoft.com/azure/azure-sql/managed-instance/link-feature Recommended Migration Approach (DEV → PROD) Phase 1 – Assessment Review schema overlaps and key conflicts Identify identity and primary key strategies Confirm availability of: Change Tracking ModifiedDate / watermark columns 📘 Change Tracking: https://learn.microsoft.com/sql/relational-databases/track-changes/about-change-tracking-sql-server Phase 2 – Initial Seeding (DEV) Use ADF Copy Activity for full loads Ingest each source DB into: Dedicated schemas, or Logical partitions Validate: Row counts Referential integrity Performance impact Phase 3 – Incremental Replication Enable incremental pipelines Recommended frequency: every 1–5 minutes Use parallelism for scalability Simulate Data Sync behavior without triggers Phase 4 – Cutover Optional short write freeze Final delta sync Application validation Promote pipelines to PROD Data Loss and Performance Expectations Metric Expected Outcome RPO Minutes (configurable) Downtime Near‑zero Performance impact Predictable and controllable Observability Built‑in via ADF monitoring Final Recommendation Summary ✅ Azure Data Factory with initial full load + incremental replication ✅ Azure-native, strategic, and supported ✅ Ideal for Data Sync retirement scenarios ✅ Scales from DEV to PROD with minimal redesign Azure SQL Data Sync retirement is an opportunity—not a setback. With services like Azure Data Factory, customers can move toward: Better observability Cleaner architectures Easier production operations Long-term platform alignment If you are still relying on Azure SQL Data Sync, now is the right time to assess, plan, and migrate. Helpful Resources Azure SQL Data Sync overview https://learn.microsoft.com/azure/azure-sql/database/sql-data-sync-data-sql-server-sql-database Azure Data Factory incremental copy https://learn.microsoft.com/azure/data-factory/incremental-copy-overview Azure SQL change tracking https://learn.microsoft.com/sql/relational-databases/track-changes/about-change-tracking-sql-server192Views0likes0CommentsLessons Learned #535: BACPAC Import Failures in Azure SQL Database Due to Incompatible Users
We recently worked on a support case where a customer was trying to import a BACPAC file, generated on a different server and subscription, into their Azure SQL Database. The process kept failing with the following errors: "Could not import package. Error SQL72014: Framework Microsoft SqlClient Data Provider: Msg 33159 - Only connections established with Active Directory accounts can create other Active Directory user" At first glance, this looked like a permissions issue, but digging deeper we realized that the error was triggered when the import process tried to create Entra ID (Azure AD) users while the connection was being made with a SQL Login, We checked several things in the BACPAC: The BACPAC contained references to external Active Directory users that were valid in the source environment but not in the target. Both the Azure portal and SQL Server Management Studio (SSMS) failed with the same error. Since BACPAC files include both schema and user objects, incompatible users were being carried over and breaking the import. After thorough investigation, the following resolution path was established: We created a dummy copy of the source database. We removed the external AD/Entra users from that copy. We generated a new BACPAC from this cleaned database. We imported it into the target Azure SQL Database — and this time it worked. We explained several details: BACPAC files included both schema and security objects, including users. If external Active Directory users are not present in the target environment can cause import failures. Before exporting, review and remove or adjust user objects to avoid this issue — particularly when migrating across subscriptions, servers, or organizations with different Azure AD tenants.186Views0likes0CommentsLesson Learned #531: Scalar UDF vs Parallelism
Last week I worked on a support case where our customer reported that the exact same query, executed against two identical databases with the same resources, was taking significantly longer on one of them. Both databases had the same number of rows, up-to-date statistics, and identical indexes. We started by collecting the execution plans, and I’d like to share what we found. Comparing both execution plans, in the XML of the execution plan that is taking more time, we found the following line in <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="TSQLUserDefinedFunctionsNotParallelizable"> However in the XML of execution plan that is taking less time we found <QueryPlan DegreeOfParallelism="1" ContainsInlineScalarTsqlUdfs="true"> So, based on this difference, it is clear that the query is using a Scalar UDF but in one of the database, based on the definition of this Scalar UDF function is not possible to run the query in parallel. But in the other database even using Scalar UDF it is possible. As both databases are using the same compatibility level of 160, we started to analyze what is different on both that leads to this behavior, sharing with you an example. DROP TABLE IF EXISTS dbo.TestData; GO CREATE TABLE dbo.TestData ( ID INT IDENTITY(1,1) PRIMARY KEY, Value1 INT, Value2 INT ); INSERT INTO dbo.TestData (Value1, Value2) SELECT ABS(CHECKSUM(NEWID()) % 10000), ABS(CHECKSUM(NEWID()) % 10000) FROM sys.all_objects a CROSS JOIN sys.all_objects b WHERE a.object_id < 150 AND b.object_id < 150; Let's create the Scalar function that blocks the parallel execution. CREATE OR ALTER FUNCTION dbo.fn_BlockParallel (@v1 INT) RETURNS INT AS BEGIN DECLARE @x INT; SELECT @x = DATEDIFF(MILLISECOND, GETDATE(), SYSDATETIME()); RETURN ISNULL(@x, 0); END; When I executed the following query I see in the XML file the following - <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="TSQLUserDefinedFunctionsNotParallelizable" CachedPlanSize="16" CompileTime="1" CompileCPU="1" CompileMemory="216"> SELECT ID, dbo.fn_BlockParallel(Value1) FROM dbo.TestData WHERE Value1 > 100 OPTION (MAXDOP 4); GO If I modified the code for a new Scalar UDF, I see: <QueryPlan DegreeOfParallelism="1" CachedPlanSize="16" CompileTime="1" CompileCPU="1" CompileMemory="272" ContainsInlineScalarTsqlUdfs="true"> CREATE OR ALTER FUNCTION dbo.fn_BlockParallel (@v1 INT) RETURNS INT AS BEGIN DECLARE @x INT; SELECT @x = v1 * 2; RETURN @x; END; So, even when using compatibility level 160, certain constructs inside scalar UDFs can prevent inlining, which in turn blocks query parallelism. When performance varies between environments, one of the things to check is whether scalar UDFs are involved, and if they are eligible for inlining. To detect the issue quickly, look at the execution plan XML and check the attributes DegreeOfParallelism, ContainsInlineScalarTsqlUdfs, and NonParallelPlanReason.Lesson Learned #533: Intermittent Azure SQL Database Connectivity and Authentication Issues
While working on a recent service request, we helped a customer troubleshoot intermittent connection and authentication failures when accessing Azure SQL Database using Active Directory (Entra ID) authentication from a Java-based application using HikariCP with JDBC/ODBC. They got the following error: com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: Failed to authenticate.. Request was throttled according to instructions from STS. Retry in 29701 ms. java.sql.SQLTransientConnectionException: HikariPool-application1 - Connection is not available, request timed out after The first insight was focusing in the error message: Request was throttled according to instructions from STS. Retry in 29701 ms. This message seems it is returned by the Azure Active Directory Security Token Service (STS) when the client is sending too many token requests in a short period of time, exceeding the allowed threshold. We don't have all the details about, but, in high-concurrency environments (e.g., multiple threads, large connection pool) causes each thread to independently request a new token and we could reach a limit in this service, even, if the connection pool retries frequently or fails authentication, the number of token requests can spike. This is the reason, that HikariCP tries to initialize or refresh connections quickly, as many threads attempt to connect at once, and all trigger token requests simultaneously, STS throttling is reached. In order to avoid this situation, could be different topics, like, ensure our application caches tokens and reuses them across threads, using Managed Identity, increase the retry after delay, or perhaps, depending on HikariCP configuration, pre-warm connections gradually. Of course, discuss with your EntraID administration is other option.