Forum Discussion
Data Archival in Merge replication
We are working on purging data in 3.5 TB database which is acting as a publisher in Merge replication with 2 subscribers which are hosted on EC2 machines in Europe and US regions respectively. We are looking for a Data Archival Strategy to optimize the data sync between publisher and subscribers in the best possible way as the volume of records that are going to be purged in publisher will be in millions.
This huge data purge in publisher will choke the subscribers and might delay the data sync from days to weeks. Hence we are looking for some recommendations on the archival strategy.
- Mks_1973Iron Contributor
First Assess Data to Purge:
Identify the tables and records to archive based on business rules, such as dates or other criteria. Ensure you’re only purging data that doesn’t need to be replicated.
Instead of deleting millions of records at once, batch the purge process. This minimizes the load on the replication process and reduces potential lock contention
Create appropriate indexes on the columns used for identifying data to purge to optimize query performance
If possible, consider moving archived data to tables that are outside of the replication scope, especially if the data no longer needs to be synchronized across servers
If the tables are partitioned (for example, by date), you can easily switch entire partitions to a separate archival database. This will allow you to archive or purge large data blocks without deleting individual rows, which is more efficient.
Move historical data to a separate archival database. You can use techniques like ETL (Extract, Transform, Load) or BCP (Bulk Copy Program) to transfer data to the archival database.
Schedule the purge operations during off-peak hours to minimize impact on active replication.
Implement pauses between purges or use smaller batches, such as purging in increments of a few thousand rows per minute, to avoid overwhelming replication
Regularly monitor the Merge Replication latency during purging and adjust batch sizes if necessary.
For massive data purges, consider reinitializing the subscription after the purge. This allows subscribers to synchronize with the publisher from a clean state
Implement row-level filters in replication articles to replicate only the essential data, reducing the overall data volume transferred.
Increase the values for -UploadGenerationsPerBatch and -DownloadGenerationsPerBatch in the Merge Agent profile. (These parameters control the number of changes processed per batch, which can help to sync large volumes of data faster)
Snapshot Replication as an Alternative
If the subscribers are primarily read-only or don’t need to synchronize with the publisher in near-real-time, consider switching from Merge Replication to Snapshot Replication, Snapshot Replication is more bandwidth-intensive, it may be more efficient for subscribers that don’t need immediate updates.
If feasible, define logical partitions so that older data doesn’t synchronize to subscribers. This way, archival data won’t sync to subscribers at all, keeping only relevant, current data in sync.
Sample Archival Workflow Using T-SQLDECLARE @BatchSize INT = 1000
DECLARE @RowsDeleted INTSET @RowsDeleted = 1
WHILE @RowsDeleted > 0
BEGIN
BEGIN TRANSACTIONDELETE TOP (@BatchSize) FROM YourTable
WHERE YourConditionForArchival -- e.g., Date < '2020-01-01'SET @RowsDeleted = @@ROWCOUNT
COMMIT TRANSACTION
-- Wait between batches
WAITFOR DELAY '00:00:10' -- 10 seconds delay (adjust as necessary)
END
Batch the Purge Process and execute it during off-peak hours with monitoring in place.
Archive Data Outside Replication Scope to prevent it from affecting Merge Replication performance.
Consider Partition Switching if your database tables are partitioned for efficient large data moves.
Reinitialize Replication if necessary, to bring subscribers back in sync after a massive purge.
Tune Replication Settings such as batch sizes for Merge Agent to help handle larger data volumes.