This article describes how a Copy Database operation may cause the source database to run out of transaction log space. The client applications may then encountered the error 9002: "The transaction log for database 'xxxxx’ is full due to ‘AVAILABILITY_REPLICA'". This description is derived from an actual support case.
Each day between 04:00h and 09:00h UTC, a large data load is refreshing the content of a Premium P6 database. This database is the primary node of a geo-replication configuration. The geo-replication secondary is running on the same data center and is used for load balancing. Before each data load, usually starting at 00:10h UTC, a database copy is performed off the geo-replication secondary. The copy usually takes around 3 hours to complete and finishes before the load operation starts. The copied database is running as a Premium P1; it allows to compare the data before and after the data load, and serves as a backup in case something goes wrong with the data load itself.
Details about the problem:
One day the Copy Database operation got interrupted at around 02:00h UTC because of a local database reconfiguration/failover.
After the interruption, the copy had to restart from the beginning. Thus it had already lost almost 2 hours of its operation window.
Due to the delay, the copy process was still running when the load operation started. This made the situation worse: the extra resource consumption was limiting the throughput on the database, and there was more data to be transferred as part of the copy. In addition, the target database was running at a much lower performance level than the source database. This caused the copy process to take much longer than usual.
The copy process is done via a backup of the source database and a restore to the destination running over the datacenter network.
While the operation is running, it is holding a lock on the transaction log of its source database. The transaction log needs to be held so that data changes occurring during the copy process can be applied at its end.
If the log truncation is held at a geo-replication secondary, it will also hold the log truncation on the geo-replication primary.
This caused the transaction log on the geo-replication primary to keep growing until it ran out of space. The client applications then encountered the error 9002: "The transaction log for database 'xxxxx’ is full due to 'AVAILABILITY_REPLICA'".
The transaction log was finally truncated after the copy operation had finished at about 13:00 UTC.
Possible steps to avoid the issue:
Use a point-in-time-restore (PITR) instead of a database copy for this daily workflow. The PITR process runs completely offline (backup is restored from storage instead of from the live database) and therefore cannot impact the primary. With the restore process you also have more control over selecting the exact point-in-time of the copy.
Start the copy process earlier, allowing for enough time for retries in case it gets interrupted. But this may not always work in case the copy operation is taking longer than expected, e.g. due to fluctuations in the network speed.
For the future, the plan is to remove the requirement to hold the transaction log during a database copy. It is still unsure though when this change will become available.